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It is impossible to talk about human cognition without talking about concepts— 
there simply is no human cognition without concepts. Concepts form an abstraction 
of reality that is central to the functioning of the human mind. Conceptual knowledge 
(of e.g., APPLE, LOVE and BEFORE) is crucial for us to categorize, understand, 
and reason about the world. Only equipped with concepts and words for them can we 
successfully communicate and carry out actions. But what exactly are concepts ? How 
are concepts acquired? How does the human mind use concepts ? Such questions have 
been a subject of discussion since antiquity and remain highly relevant in multiple 
fields (e.g., Murphy 2002; Margolis and Laurence 2015). 

Recent decades have seen fruitful results and methodological advances on concept 
research in disciplines such as linguistics, philosophy, psychology, artificial intelli- 
gence, and computer science. For instance, cognitive psychologists use empirical 
experiments to validate formal models of concept representation and learning such 
as the prototype theory (Rosch et al. 1976), the exemplar theory (Murphy 2016) or 
other alternative theories (Rogers and McClelland 2004; Blouw et al. 2016). Linguists 
pursue the goal of assigning more precise meaning to natural language expressions 
by mainly applying logic-based formalisms (Asher 2011). In machine learning, deci- 
sion boundaries in high-dimensional feature spaces are used to define membership 
to a concept (Mitchell 1997). Moreover, researchers in the semantic web area have 
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created large ontologies (Gémez-Pérez et al. 2004) containing hierarchies of con- 
cepts formulated in description logics. Google’s “Knowledge Graph” illustrates how 
such ontologies can be used in industrial applications. 

Despite of this plethora of research, there remain many open questions, unsolved 
debates and methodological challenges. For instance, the ontologies of the semantic 
web have been challenged as being unable to represent information about conceptual 
similarity and thus as being ill-suited for representing conceptual knowledge (Gar- 
denfors 2004). And although deep learning models are often said to acquire concepts 
when learning to classify pictures of dogs, umbrellas, and other objects, they can be 
easily fooled by slightly manipulated input images (Szegedy et al. 2013)—which 
highlights that they only learn patterns, but no conceptual knowledge. 

One major obstacle for a better and more holistic understanding of concepts 
is that research on concepts has usually been carried out in different disciplines 
individually—with different approaches, different goals, and different results. The 
multi-disciplinary research efforts usually run in parallel without enough interaction; 
existing interdisciplinary research projects usually do not involve more than two dis- 
ciplines, for example linguistics and computer science in the WordNet project (Fell- 
baum and Vossen 2016) or psychology and artificial intelligence in cognitive archi- 
tectures like ACT-R (Anderson 2009) or SOAR (Laird 2012). In order to move the 
scientific understanding of concepts forward, we need a truly interdisciplinary per- 
spective on concepts, involving a mutual understanding of the different approaches 
from different disciplines, a lively exchange of ideas, and synergies arising from the 
combination of different research perspectives and methods. Thus, our volume will 
focus on selected recent issues, approaches and results that are not only central to 
the highly interdisciplinary field of concept research, but that are also particularly 
important to newly emergent paradigms and challenges. 

This volume focuses on three topics (i.e., three distinct points of view) that lie at the 
core of concept research: representation, learning, and application. In the following, 
we will first present research questions related to the three topics (Sect. 1), and then, 
we will provide an overview of the contributions (Sect. 2). 


1 Research Questions 


In order to structure an interdisciplinary discussion and exchange about concept 
research, we found it useful to put a focus on three essential questions that need to 
be answered: How can conceptual knowledge be represented (Sect. 1.1)? How are 
concepts acquired (Sect. 1.2)? How is conceptual knowledge applied in cognitive 
tasks (Sect. 1.3)? 
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1.1 Representation: How Can We Formally Describe 
and Model Concepts? 


One of the major challenges in concept research is to find a formal representation 
of concepts that is on the one hand able to explain a wide range of empirical obser- 
vations and experimental results and that can on the other hand be easily applied 
in practice. Exemplar and prototype theories from psychology focus on the crucial 
role of representative instances, whereas knowledge-based theories (Murphy and 
Medin 1985) emphasize that concepts do not occur in isolation, but always stand in 
relations to other concepts. Ontologies (Gémez-Pérez et al. 2004) from the semantic 
web area provide a formal way of describing such networks of concepts. The logical- 
formal approaches from linguistics aim at accounts of the context-independent and 
context-dependent aspects of meaning and can be related to logic-based representa- 
tions in artificial intelligence (Russell and Norvig 2002). Finally, the feature spaces 
commonly used in the field of machine learning (Mitchell 1997) (for example in 
nearest-neighbor classifiers) can be linked to prototype and exemplar approaches 
from psychology. When analyzing formal representations of concepts, the following 
questions should be considered: 


e What are the underlying assumptions of different representation approaches? How 
are they motivated and to what extent are they compatible with one another? 

e How can different representation formalisms be compared with one another? What 
are useful and meaningful criteria for making such a comparison? 

e How transferable are the different approaches to other domains? For instance, are 
there any benefits in using a prototype approach in linguistic analyses of natural 
language semantics? 

e How can different representational approaches be augmented or combined with 
one another in order to arrive at a more holistic model? 


1.2 Learning: Where Do Concepts Come from and How Are 
They Acquired? 


Another major issue in concept research is concerned with concept acquisition, which 
is not only important per se but also essential for evaluating whether a specific theory 
of human concepts is psychologically plausible (Carey 2015). While there are well- 
established assumptions about children’s acquisition of core concepts such as the 
basic-level bias and the taxonomic assumption, the exact nature of the underlying 
processes remains controversial. On a larger time scale, the evolution of concepts 
in human societies (Hull 1920) and similar processes in groups of robots (Spranger 
2012) can give insights into learning processes. Moreover, studying concept learning 
across languages and cultures enables a better understanding of universality and 
diversity in concepts (Imai et al. 2010). Furthermore, to successfully coin and transfer 
new concepts, it is crucial to understand differences between everyday concepts and 


4 L. Bechberger and M. Liu 


expert concepts, e.g., in mathematics (Rips et al. 2008). These and other related 
issues (e.g., innateness, groundedness and embodiment) require researchers to not 
only strive for advances in their own field (such as in terms of improved machine 
learning algorithms in artificial intelligence), but also to start in-depth exchanges 
with neighboring disciplines. The following questions can provide useful guidelines 
when approaching concept learning: 


Which kinds of learning mechanisms (e.g., supervised vs. unsupervised, multi- 
modal vs. unimodal) are in the focus of the different research disciplines? Can 
multiple learning mechanisms be combined with one another? 

Which representational assumptions are made by the different learning mecha- 
nisms? 

e How does concept learning interact with the development of low-level and other 
high-level cognitive (e.g., motorsensory, reasoning) abilities? What are their under- 
lying mechanisms? 

What are the differences between learning concrete (e.g., APPLE) and abstract 
concepts (e.g., LOVE) and between learning expert and everyday concepts? 


1.3 Application: How Are Concepts Used in Cognitive Tasks? 


The last decade has witnessed an exploding utilization of conceptual knowledge 
bases, unprecedented both in scale and range of applications. The conceptual core 
of the semantic web (Berners-Lee et al. 2001) and artificial agents like IBM’s Wat- 
son (Ferrucci et al. 2010) is largely based on AI technologies dating back to the 
last millennium (e.g., description logics). The new development has clearly shown 
the potential but also the limits of such approaches. The questions that arise here 
obviously link to other fields: The combination of a multitude of potential resources 
asks for modern AI methods to reason over heterogeneous and inconsistent data 
(Potyka and Thimm 2017). The application of conceptual knowledge in communi- 
cation, including conceptual combination and application of conceptual knowledge 
in context, are classical problems in linguistics. And the problem of generating new 
concepts may find answers in recent psychological theories on creativity (Schorlem- 
mer et al. 2014). The following questions are important with regard to the application 
of concepts: 


Which aspects of applying concepts are analyzed in the different disciplines? Is 
there any considerable overlap? 

How do the different representational formalisms constrain specific application 
scenarios and vice versa? 

Which contextual effects occur in the application of concepts? How are these 
effects handled in different frameworks? 

Which mechanisms exist for performing conceptual combination? Which con- 
straints apply? How are conflicts that emerge from conflicting conceptualizations 
detected and resolved? 
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2 Summaries of the Contributed Chapters 


This volume consists of seven individual chapters from different scientific disci- 
plines, each of which relates to at least one of the other topics presented in Sect. 1. 
Figure | illustrates how the individual contributions relate to each other, based on 
their underlying disciplines, common themes, and the three focus topics from Sect. 1. 
Figure | illustrates both the strong relations between the individual contributions and 
the broad spectrum of this edited volume. We will now introduce the individual con- 
tributions in more detail. 

Bechberger and Kiihnberger’s contribution “Generalizing Psychological Sim- 
ilarity Spaces to Unseen Stimuli — Combining Multidimensional Scaling with Arti- 
ficial Neural Networks” (Chap. 2) addresses the focus topic of learning. It uses a 
spatial model of concepts as regions in psychological similarity spaces based on 
Gärdenfors’ cognitive framework of conceptual spaces. These similarity spaces are 
typically obtained based on dissimilarity ratings from psychological studies and the 
technique of “multidimensional scaling” (MDS). This approach is however unable 
to generalize to unseen inputs. The authors propose to use MDS on human similarity 
ratings for initializing the similarity space and ANNs (artificial neural networks) to 
learn a mapping from raw stimuli into this similarity space. This proposal is a valu- 
able contribution for integrating psychology and artificial intelligence. In order to 


Artificial Intelligence 
and Logics 


Linguistics 


$ Imprecision 


(REP) Gust D} 
and Umbach 
(Chapter 4) 


y A Vernillo ‘ : \ ` 

: (Chapter 8) . 4 - : nee" ~ 

\ ‘(APP) Goat, | | it sense TE 

rl D and Bechberger ' ad (REP) Färber, Sve N o’ 

st (Chapter 5) = a tashova, and Harth | Yer 

rrr. i pen ret tenes F ( (Chapter 3) BAC 
and Kihnberger 


“ (APP) Schneider D) 
s$ and Nürnberger ' 
AA (Chapter 6) j’ > 
. = (Chapter 2) 


` { i 
“Va hee eos Pere IETT ' 
(APP) Scerrati, | ` SoaRpaS 


(EN Bectbowger) | Spatial Models 


lani,and Rubichi 
(Chapter 7) . 


Psychology 


Fig. 1 Visualization of the contributed chapters based on scientific disciplines (solid ellipses), 
common research themes (dashed rectangles), and classification based on the three focus topics 
representation (REP), learning (LRN), and application (APP) 
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validate their hybrid approach, the authors conducted a feasibility study. Their results 
show that while their proposal works in principle, the generalization capabilities of 
the ANNs are still limited and need to be improved further. 

Farber, Svetashova, and Harth’s contribution “Theories of Meaning for the 
Internet of Things” (Chap.3) is concerned with the representation of concepts in 
the context of the Internet of Things (IoT) from the perspective of artificial intel- 
ligence. They compare different representational frameworks from philosophy and 
computer science, taking a simple smart home setting as an application example. 
Overall, they consider four different approaches, namely model-theoretic semantics 
(which are based on first-order logic), possible world semantics (using modal logic), 
situation semantics, and cognitive and distributional semantics (i.e., spatial models 
of meaning). With the IoT application in mind, the authors assess whether these rep- 
resentational frameworks are able to represent intersubjectivity (i.e., multiple agents) 
and dynamics (i.e., changes in the state of the world) and to what extent they can be 
connected to perception. The authors conclude that none of the existing approaches 
is able to completely satisfy all three requirements. They propose to further investi- 
gate a combination between situational and distributional semantics as a promising 
avenue for future research. 

Also Gust and Umbach’s contribution “A Qualitative Similarity Framework for 
the Interpretation of Natural Language Similarity Expressions” (Chap. 4) explores 
the representation of concepts in the context of natural language semantics. It aims at 
the interpretation of expressions of similarity and sameness, such as so/similar/same 
in English or their counterparts in German. The authors argue that treating similarity 
as a primitive predicate is unsatifactory because semantic differences between indi- 
vidual similarity expressions could not be accounted for and the role of similarity 
expressions in creating ad-hoc kinds, for example, by similarity demonstratives and 
scalar and non-scalar equatives would be obscured. The framework proposed in the 
paper introduces a non-metric qualitative concept of similarity which makes use of a 
spatial model called attribute spaces equipped with systems of predicates correspond- 
ing to predicates on the domain. Individuals are mapped to points in attribute spaces 
by generalized measure functions. Two individuals count as similar if their images 
in a particular attribute space given a particular predicate system cannot be distin- 
guished. This allows representations of varying granularity and hence of different 
degrees of imprecision. The authors argue that the framework is suited for model- 
ing the meaning of natural language similarity expressions and, moreover, account 
for their role in ad-hoc kind formation constructions. It thus provides a logic-based 
formalism which is able to explain linguistic phenomena. 

Gega, Liu and Bechberger’s contribution “Numerical Concepts in Context” 
(Chap.5) deals with the semantics and pragmatics of numerical expressions, with a 
focus on their precise or imprecise interpretations. While the precise interpretation 
most prominently appears in mathematical contexts, the imprecise interpretation 
seems to arise when numbers (as quantities) are applied to real world contexts (e.g., 
the rope is 50m long). Earlier literature shows that the (im)precise interpretation 
can depend on different factors, e.g., the kind of approximators a numeral appears 
with (precise vs. imprecise, e.g., exactly vs. roughly) or the kind of the numeral 
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itself (round vs. non-round, e.g., 50 vs. 47). The authors report on a corpus-linguistic 
study and a psycholinguistic rating experiment of English numerical expressions. 
The results confirm the effects of both factors, and additionally also an effect of the 
kind of unit, namely, whether it refers to discrete versus continuous concepts (e.g., 
PEOPLE vs. METER). 

Schneider and Niirnberger’s contribution “Evaluating Semantic CoCreation 
by using a Marker as Linguistic Constraint in Cognitive Representation Mod- 
els” (Chap. 6) explores the application of conceptual knowledge in communication 
between multiple agents. More specifically, they address semantic co-creation, i.e., 
the convergence of the cognitive models of the interlocutors within a conversation. 
The authors hypothesize that a shared marker can facilitate this coordination of repre- 
sentations. In order to validate this hypothesis, they conducted an experiment where 
groups of three participants needed to identify a target location on a given map. 
One participant (the describer) was given the target and had to describe it to the two 
other participants who needed to correctly identify this target location. One of them 
(the committer) was able to give feedback to the describer while the other one (the 
observer) had to remain passive. The authors considered four experimental condi- 
tions which differed in the availability of a shared marker (i.e., a movable point on the 
map) and in the complexity of the task (measured by the number of cities displayed 
on the map). Their results show that when task complexity was low, no real interac- 
tion between the participants was necessary to successfully solve the task. Contrary 
to their expectations, the shared marker was not able to improve performance in the 
high-complexity scenario. While their results highlight that a certain level of com- 
plexity is necessary to elicit interactions, it also casts doubt on the assumption that 
additional means of communication (such as a shared marker) necessarily improve 
the outcome of the interaction. Their work thus urges for further research both in 
psychology and linguistics to gain a deeper understanding of the observed effects. 

Scerrati, Iani and Rubichi’s contribution “Does the Activation of Motor Infor- 
mation Affect Semantic Processing?” (Chap. 7) considers the application of concepts 
in lexical decision tasks, focusing on the influence of pre-activated motor informa- 
tion. The authors report on a psychological priming experiment in which the subjects 
were instructed to make keypress responses depending on two factors: One factor is 
word type with target words being relevant/irrelevant/unrelated to action (e.g., han- 
dle/ceramic/eyelash) with respect to a prime object (e.g., image of a frying pan). The 
other factor is spatial compatibility with the related part of the prime object (e.g., 
handle for a frying pan) either on the same side or on the opposite side of the key 
to be pressed. The dependent measures were reading time (RT) latencies and error 
rates for the question whether the target word was an Italian word. The results of 
the RT latencies did not show any significant effects or an interaction. The results of 
the error rates however showed a significant main effect of word type with the lex- 
ical decision responses being more accurate with action-relevant target words than 
with action-irrelevant words or unrelated words. This indicates that motor activation 
may indeed influence semantic processing, thus complementing and enriching the 
literature that focuses on the reverse effect of semantic content on motor activation. 
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Also Vernillo’s contribution “Grounding Abstract Concepts in Action: The 
Semantic Analysis of Four Italian Action Verbs Encoding Force Events” (Chap. 8) 
focuses on the application of conceptual knowledge, comparing the concrete and 
metaphorical uses of the four Italian action verbs premere, spingere, tirare, and 
trascinare (i.e., ‘press’, ‘push’, ‘pull’, and ‘drag’). The underlying hypothesis is that 
the image schema of their literal meaning also constrains their usage in the metaphor- 
ical meaning. The linguistic study uses the representation of verb meanings through 
3D scenes from the IMAGACT database. Based on the extracted data, the author pro- 
vides a description of the semantic resemblances and differences in terms of salient 
image-schematic structures. The results show that the four verbs under considera- 
tion belong to the same semantic class of force (involving motor information and 
movement), and that they share commonalities in their literal and metaphorical use. 
At the same time, one can also observe systematic differences: For instance, while 
the literal meaning of premere focuses on the force exerted on the object, spingere 
emphasizes the resulting movement. These different connotations are also transferred 
to the metaphorical usage where spingere entails a change of state while premere 
does not. The results of this analysis support the view that metaphors are not just a 
linguistic phenomenon, but are grounded in embodied conceptual knowledge. 
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1 Introduction 


In this chapter, we propose a combination of psychologically derived similarity rat- 
ings with modern machine learning techniques in the context of cognitive artificial 
intelligence. More specifically, we extract a spatial representation of conceptual simi- 
larity from psychological data and learn a mapping from visual input onto this spatial 
representation. 

We base our work on the cognitive framework of conceptual spaces (Gärdenfors 
2000), which proposes a geometric representation of conceptual structures: Instances 
are represented as points and concepts are represented as regions in psychological 
similarity spaces. Based on this representation, one can explain a range of cognitive 
phenomena from one-shot learning to concept combination. Conceptual spaces can 
be interpreted as a spatial variant of the influential prototype theory of concepts 
(Rosch et al. 1976) by identifying the prototype of a given category with the centroid 
of the respective convex region. Moreover, conceptual spaces can be related to the 
feature spaces typically used in machine learning (Mitchell 1997), where individual 
observations are also represented as sets of feature values and where the task is to 
identify regions which correspond to pre-defined categories. 

As Gärdenfors (2018) has argued, the framework of conceptual spaces splits the 
overall problem of concept learning into two sub-problems: On the one hand, the 
space itself with its distance relation and its underlying dimensions needs to be 
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learned. On the other hand, one needs to identify meaningful regions within this 
similarity space. The latter problem can be easily solved by simple learning mech- 
anisms such as taking the centroid of a given set of category members (Gärdenfors 
2000). The problem of obtaining the similarity spaces themselves is however much 
harder. While in humans, the dimensions of these spaces may be partially innate 
or learned based on perceptual invariants (Gardenfors 2018), it is difficult to mimic 
such processes in artificial systems. 

When using conceptual spaces as a modeling tool, one can distinguish three ways 
of obtaining the underlying dimensions: If the domain of interest is well understood, 
one can manually define the dimensions and thus the overall similarity space. This can 
for instance be done for the domain of colors, for which a variety of similarity spaces 
exists. A second approach is based on machine learning algorithms for dimension- 
ality reduction. For instance, unsupervised artificial neural networks (ANNs) such 
as autoencoders or self-organizing maps can be used to find a compressed represen- 
tation for a given set of input stimuli. This task is typically solved by optimizing 
a mathematical error function which may be not satisfactory from a psychological 
point of view. 

A third way of obtaining the dimensions of a conceptual space is based on dissim- 
ilarity ratings obtained from human subjects. One first elicits dissimilarity ratings for 
pairs of stimuli in a psychological study. The technique of “multidimensional scaling” 
(MDS) takes as an input these pair-wise dissimilarities as well as the desired number 
t of dimensions. It then represents each stimulus as a point in a t-dimensional space 
in such a way that the distances between points in this space reflect the dissimilari- 
ties of their corresponding stimuli. Nonmetric MDS assumes that the dissimilarities 
are only ordinally scaled and limits itself to representing the ordering of distances 
correctly. Metric MDS on the other hand assumes an interval or ratio scale and also 
tries to represent the numerical values of the dissimilarities as closely as possible. We 
introduce multidimensional scaling in more detail in Sect.2. Moreover, we present 
a study investigating the differences between similarity spaces produced by metric 
and nonmetric MDS in Sect. 3. 

One limitation of the MDS approach is that it is unable to generalize to unseen 
inputs: If a new stimulus arrives, it is impossible to directly map it onto a point 
in the similarity space without eliciting dissimilarities to already known stimuli. In 
Sect. 4, we propose to use ANNs in order to learn a mapping from raw stimuli to sim- 
ilarity spaces obtained via MDS. This hybrid approach combines the psychological 
grounding of MDS with the generalization capability of ANNs. 

In order to support our proposal, we present the results of a first feasibility study 
in Sect.5: Here, we use the activations of a pre-trained convolutional network as 
features for a simple regression into the similarity spaces from Sect. 3. 

Finally, Sect. 6 summarizes the results obtained in this paper and gives an outlook 
on future work. Code for reproducing both of our studies can be found online at 
https://github.com/lbechberger/LearningPsychologicalSpaces/ (Bechberger 2020). 

Our overall contribution can be seen as providing artificial systems with a way to 
map raw perceptions onto psychological similarity spaces. These similarity spaces 
can then be used in order to learn conceptual regions and to reason with them. Our 
research has strong relations to two other chapters in this edited volume. 
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The conceptual spaces framework itself can be considered as a specific instance 
of the approach labeled as “cognitive and distributional semantics” in the contribu- 
tion by Farber, Svetashova, and Harth (Chap. 3). Our hybrid proposal from Sect. 4 
exemplifies the procedure of obtaining such a cognitive representation which is both 
psychologically grounded and applicable to novel stimuli. Especially the latter prop- 
erty of our hybrid proposal is crucial for applications in technical systems such as 
the Internet of Things (IoT) considered by Farber, Svetashova, and Harth. 

Also the attribute spaces used by Gust and Umbach (Chap. 4) are closely related 
to the similarity spaces considered in our contribution. While our work focuses on 
grounding such a similarity space in perception, Gust and Umbach analyze how natu- 
ral language similarity expressions can be linked to spatial models. The contribution 
by Gust and Umbach can thus be seen as a complement to our work, considering a 
higher level of abstraction. 


2 Multidimensional Scaling 


In this section, we provide a brief introduction to multidimensional scaling. We first 
give an overview of the elicitation methods for similarity ratings in Sect. 2.1, before 
explaining the basics of MDS algorithms in Sect. 2.2. The interested reader is referred 
to Borg and Groenen (2005) for a more detailed introduction to MDS. 


2.1 Obtaining Dissimilarity Ratings 


In order to collect similarity ratings from human participants, several different tech- 
niques can be used (Goldstone 1994; Hout et al. 2013; Wickelmaier 2003). They are 
typically grouped into direct and indirect methods: In direct methods, participants 
are fully aware that they rate, sort, or classify different stimuli according to their 
pairwise dissimilarities. Indirect methods on the other hand are based on secondary 
empirical measurements such as confusion probabilities or reaction times. 

One of the classical direct techniques is based on explicit ratings for pairwise 
comparisons. In this approach, all possible pairs from a set of stimuli are presented 
to participants (one pair at a time), and participants rate the dissimilarity of each pair 
on a continuous or categorical scale. Another direct technique is based on sorting 
tasks. For instance, participants might be asked to group a given set of stimuli into 
piles of similar items. In this case, similarity is binary—either two items are sorted 
into the same pile or not. 

Perceptual confusion tasks can be used as an indirect technique for obtaining 
similarity ratings. For example, participants can be asked to report as fast as possible 
whether two displayed items are the same or different. In this case, confusion prob- 
abilities and reaction times are measured in order to infer the underlying similarity 
relation. 
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Goldstone (1994) has argued that the classical approaches for collecting similarity 
data are limited in various ways. Their biggest shortcoming is that explicitly testing 
all vn) stimulus pairs is quite time-consuming. An increasing number of stimuli 
therefore leads to very long experimental sessions which might cause fatigue effects. 
Moreover, in the course of such long sessions, participants might switch to a different 
rating strategy after some time, making the collected data less homogeneous. 

In order to make the data collection process more time-efficient, Goldstone (1994) 
has proposed the “Spatial Arrangement Method” (SpAM). In this collection tech- 
nique, multiple visual stimuli are simultaneously displayed on a computer screen. In 
the beginning, the arrangement of these stimuli is randomized. Participants are then 
asked to arrange them via drag and drop in such a way that the distances between 
the stimuli are proportional to their dissimilarities. Once participants are satisfied 
with their solution, they can store the arrangement. The dissimilarity of two stimuli 
is then recorded as their Euclidean distance in pixels. As N items can be displayed 
at once, each single modification by the user updates N distance values at the same 
time which makes this procedure quite efficient. Moreover, SpAM quite naturally 
incorporates geometric constraints: If A and B are placed close together and C is 
placed far away from A, then it cannot be very close to B. 

As the dissimilarity information is recorded in the form of Euclidean distances, 
one might assume that the dissimilarity ratings obtained through SpAM are ratio 
scaled. This view is for instance held by Hout et al. (2014). However, as participants 
are likely to make only a rough arrangement of the stimuli, this assumption might 
be too strong in practice. One can argue that it is therefore safer to only assume an 
ordinal scale. As far as we know, there have been no explicit investigations on this 
issue. We will provide an analysis of this topic in Sect. 3. 


2.2 The Algorithms 


In this chapter, we follow the mathematical notation by Kruskal (1964a), who gave 
the first thorough mathematical treatment of (nonmetric) multidimensional scaling. 
One can typically distinguish two types of MDS algorithms (Wickelmaier 2003), 
namely metric and nonmetric MDS. Metric MDS assumes that the dissimilarities are 
interval or ratio scaled, while nonmetric MDS only assumes an ordinal scale. 

Both variants of MDS can be formulated as an optimization problem involving 
the pairwise dissimilarities 6;; between stimuli and the Euclidean distances dj; of 
their corresponding points in the t-dimensional similarity space. More specifically, 
MDS involves minimizing the so-called “stress” which measures to which extent the 
spatial representation violates the information from the dissimilarity matrix: 


ag (a; B å) 
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stress = 
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The denominator in this equation serves as a normalization factor in order to make 
stress invariant to the scale of the similarity space. 

In metric MDS, we use dij =a-6;; + b to compute stress. This means that we 
look for a configuration of points in the similarity space whose distances are a linear 
transformation of the dissimilarities. 

In nonmetric MDS, on the other hand, the d; j are not obtained by a linear but by a 
monotone transformation of the dissimilarities: Let us order the dissimilarities of the 
stimuli ascendingly: ô; j < 5i,j. < ij < .... The å; j are then obtained by defining 
an analogous ascending order, where the difference between the disparities å; j and 
the distances dj; is as small as possible: di, j, < din jn < di, i, < ....Nonmetric MDS 
therefore only tries to reflect the ordering of the dissimilarities in the distances while 
metric MDS also tries to take into account their differences and ratios. 

There are different approaches towards optimizing the stress function, resulting 
in different MDS algorithms. Kruskal’s original nonmetric MDS algorithm (Kruskal 
1964b) is based on gradient descent: In an iterative procedure, the derivative of the 
stress function with respect to the coordinates of the individual points is computed 
and then used to make a small adjustment to these coordinates. Once the derivative 
approaches zero, a minimum of the stress function has been found. 

A more recent MDS algorithm by de Leeuw (1977) is called SMACOF (an 
acronym of “Scaling by Majorizing a Complicated Function”). De Leeuw pointed 
out that Kruskal’s gradient descent method has two major shortcomings: Firstly, if 
the points for two stimuli coincide (i.e., x; = x;), then the distance function of these 
two points is not differentiable. Secondly, Kruskal was not able to give a proof of con- 
vergence for his algorithm. In order to overcome these limitations, De Leeuw showed 
that minimizing the stress function is equivalent to maximizing another function À 
which depends on the distances and dissimilarities. This function can be easily max- 
imized by using iterative function majorization. Moreover, one can prove that this 
iterative procedure converges. SMACOF is computationally efficient and guarantees 
a monotone convergence of stress (Borg and Groenen 2005, Chap. 8). 

Picking the right number of dimensions ¢ for the similarity space is not trivial. 
Kruskal (1964a) proposes two approaches to address this problem. 

On the one hand, one can create a so-called “Scree” plot that shows the final stress 
value for different values of t. If one can identify an “elbow” in this diagram (.e., 
a point after which the stress decreases much slower than before), this can point 
towards a useful value of t. 

On the other hand, one can take a look at the interpretability of the generated 
configurations. If the optimal configuration in a t-dimensional space has a sufficient 
degree of interpretability and if the optimal configuration in a t + 1 dimensional 
space does not add more structure, then a f-dimensional space might be sufficient. 
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Fig. 1 Eight example stimuli from the NOUN data set (Horst and Hout 2016) 


3 Extracting Similarity Spaces from the NOUN Data Set 


It is debatable whether metric or nonmetric MDS should be used with data collected 
through SpAM. Nonmetric MDS makes less assumptions about the underlying mea- 
surement scale and therefore seems to be the “safer” choice. If the dissimilarities 
are however ratio scaled, then metric MDS might be able to harness these pieces 
of information from the distance matrix as additional constraints. This might then 
result in a semantic space of higher quality. 

In our study, we compare metric to nonmetric MDS on a data set obtained through 
SpAM. If the dissimilarities obtained through SpAM are not ratio scaled, then the 
main assumption of metric MDS is violated. We would then expect that nonmetric 
MDS yields better solutions than metric MDS. If the dissimilarities obtained through 
SpAM are however ratio scaled and if the differences and ratios of dissimilarities do 
contain considerable amounts of additional information, then metric MDS should 
have a clear advantage over nonmetric MDS. 

For our study, we used existing dissimilarity ratings reported for the Novel Object 
and Unusual Name (NOUN) data set (Horst and Hout 2016), a set of 64 images of 
three-dimensional objects that are designed to be novel but also look naturalistic. 
Figure | shows some example stimuli from this data set. 


3.1 Evaluation Metrics 


We used the stress0 function from R’s smacof package to compute both metric 
and nonmetric stress. We expect stress to decrease as the number of dimensions 
increases. If the data obtained through SpAM is ratio scaled, then we would expect 
that metric MDS achieves better values on metric stress (and potentially on nonmetric 
stress as well) than nonmetric MDS. If the SpAM dissimilarities are not ratio scaled, 
then metric MDS should not have any advantage over nonmetric MDS. 

Another possible way of judging the quality of an MDS solution is to look for 
interpretable directions in the resulting space. However, Horst and Hout (2016) have 
argued that for the novel stimuli in their data set there are no obvious directions that 
one would expect. Without a list of candidate directions, an efficient and objective 
evaluation based on interpretable directions is however hard to achieve. We therefore 
did not pursue this way of evaluating similarity spaces. 
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As an additional way of evaluation, we measured the correlation between the 
distances in the MDS space and the dissimilarity scores from the psychological 
study. 

Pearson’s r (Pearson 1895) measures the linear correlation of two random vari- 
ables by dividing their covariance by the product of their individual variances. Given 
two vectors x and y (each containing N samples from the random variables X and Y, 
respectively), Pearson’s r can be estimated as follows, where x and y are the average 
values of the two vectors: 


Pae Eha -DO — 9) 
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Spearman’s p (Spearman 1904) generalizes Pearson’s r by allowing also for 
nonlinear monotone relationships between the two variables. It can be computed by 
replacing each observation x; and y; with its corresponding rank, i.e., its index in 
a sorted list, and by then computing Pearson’s r on these ranks. By replacing the 
actual values with their ranks, the numeric distances between the sample values lose 
their importance—only the correct ordering of the samples remains important. Like 
Pearson’s r, Spearman’s p is confined to the interval [—1, 1] with positive values 
indicating a monotonically increasing relationship. 

Both MDS variants can be expected to find a configuration such that there is a 
monotone relationship between the distances in the similarity space and the original 
dissimilarity matrix. That is, smaller dissimilarities correspond to smaller distances 
and larger dissimilarities correspond to larger distances. For Spearman’s p, we there- 
fore do not expect any notable differences between metric and nonmetric MDS. For 
metric MDS, we also expect a linear relationship between dissimilarities and dis- 
tances. Therefore, if the dissimilarities obtained by SpAM are ratio scaled, then 
metric MDS should give better results with respect to Pearson’s r than nonmetric 
MDS. 

A final way for evaluating the similarity spaces obtained by MDS is visual inspec- 
tion: If a visualization of a given similarity space shows meaningful structures and 
clusters, this indicates a high quality of the semantic space. We limit our visual 
inspection to two-dimensional spaces. 


3.2 Methods 


In order to investigate the differences between metric and nonmetric MDS in the 
context of SpAM, we used the SMACOF algorithm in its original implementation 
in R’s smacof library.! SMACOF can be used in both a metric and a nonmetric 
variant. The underlying algorithm stays the same, only the definition of stress and 


'See https://cran.r-project.org/web/packages/smacof/smacof.pdf. 
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thus the optimization target differs. Both variants were explored in our study. We 
used 256 random starts with the maximum number of iterations per random start set 
to 1000. The overall best result over these 256 random starts was kept as final result. 

For each of the two MDS variants, we constructed MDS spaces of different dimen- 
sionality (ranging from one to ten dimensions). For each of these resulting similarity 
spaces, we computed both its metric and its nonmetric stress. 

In order to analyze how much information about the dissimilarities can be readily 
extracted from the images of the stimuli, we also introduced two baselines. 

For our first baseline, we used the similarity of downscaled images: For each 
original image (with both a width and height of 300 pixels), we created lower- 
resolution variants by aggregating all the pixels in a k x k block into a single pixel 
(withk € [2, 300]). We compared different aggregation functions, namely, minimum, 
mean, median, and maximum. The pixels of the resulting downscaled image were 
then interpreted as a point in a al x r3] dimensional space. 

For our second baseline, we extracted the activation vectors from the second- 
to-last layer of the pre-trained Inception-v3 network (Szegedy et al. 2016) for each 
of the images from the NOUN data set. Each stimulus was thus represented by its 
corresponding activation pattern. While the downscaled images represent surface 
level information, the activation patterns of the neural network can be seen as more 
abstract representation of the image. 

For each of the three representation variants (downscaled images, ANN activa- 
tions, and points in an MDS-based similarity space), we computed three types of 
distances between all pairs of stimuli: The Euclidean distance dg, the Manhattan 
distance dy, and the negated inner product drp. We only report results for the best 
choice of the distance function. For each distance function, we used two variants: One 
where all dimensions are weighted equally and another one where optimal weights 
for the individual dimensions were estimated based on a non-negative least squares 
regression in a five-fold cross validation (cf. Peterson et al. (2018) who followed a 
similar procedure). For each of the resulting distance matrices, we compute the two 
correlation coefficients with respect to the target dissimilarity ratings. We consider 
only matrix entries above the diagonal because the matrices are symmetric and all 
entries on the diagonal are guaranteed to be zero. Our overall workflow is illustrated 
in Fig. 2. 


3.3 Results 


Figure 3a shows the Scree plots of the two MDS variants for both metric and non- 
metric stress. As one would expect, stress decreases with an increasing number 
of dimensions: More dimensions help to represent the dissimilarity ratings more 
accurately. Metric and nonmetric SMACOF yield almost identical performance with 
respect to both metric and nonmetric stress. This suggests that interpreting the SpAM 
dissimilarity ratings as ratio scaled is neither helpful nor harmful. 
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Fig. 2 Illustration of our analysis setup. We measure the correlation between the dissimilarity 
ratings and distances from three different sources, namely the pixels of downscaled images (left), 
activations of an artificial neural network (middle), and similarity spaces obtained by MDS (right) 


Figure 3b shows some line diagrams illustrating the results of the correlation 
analysis for the MDS-based similarity spaces. For both the pixel baseline and the 
ANN baseline, the usage of optimized weights considerably improved performance. 
As we can see, both of these baselines yield considerably higher correlations than 
one would expect for randomly generated configurations of points. Moreover, the 
ANN baseline outperforms the pixel baseline with respect to both evaluation metrics, 
indicating that raw pixel information is less useful in our scenario than the more 
high-level features extracted by the ANN. For the pixel baseline, we observed that 
the minimum aggregator yielded the best results. 

We also observe in Fig. 3b that the MDS solutions provide us with a better reflec- 
tion of the dissimilarity ratings than both pixel-based and ANN-based distances if 
the similarity space has at least two dimensions. This is not surprising since the MDS 
solutions are directly based on the dissimilarity ratings, whereas both baselines do 
not have access to the dissimilarity information. It therefore seems like our naive 
image-based ways of defining dissimilarities are not sufficient. 

With respect to the different MDS variants, also the correlation analysis confirms 
our observations from the Scree plots: Metric and nonmetric SMACOF are almost 
indistinguishable with nonmetric SMACOF yielding slightly higher correlation val- 
ues. This supports the view that the assumption of ratio scaled dissimilarity ratings 
is not beneficial, but also not very harmful on out data set. Moreover, we find the 
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Fig. 3 a Scree plots for both metric and nonmetric stress. b Correlation evaluation for the different 
MDS solutions and the two baselines 


tendency of improved performance with an increasing number of dimensions. This 
again illustrates that MDS is able to fit more information into the space if this space 
has a larger dimensionality. 

Finally, let us look at the two-dimensional spaces generated by the two MDS 
variants in order to get an intuitive feeling for their semantic structure. Figure 4 
shows these spaces along with the local neighborhood of three selected items. These 
neighborhoods illustrate that in both spaces stimuli are grouped in a meaningful way. 
From our visual inspection, it seems that both MDS variants result in comparable 
semantic spaces with a similar structure. 

Overall, we did not find any systematic difference between metric and nonmetric 
MDS on the given data set. It thus seems that the metric assumption is neither 
beneficial nor harmful when trying to extract a similarity space. On the one hand, we 
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Fig. 4 Illustration of the two-dimensional spaces obtained by metric SMACOF (left) and nonmetric 
SMACOF (right) 


cannot conclude that the dissimilarities obtained through SpAM are not ratio scaled. 
On the other hand, the additional information conveyed by differences and ratios of 
dissimilarities does not seem to improve the overall results. We therefore advocate 
the usage of nonmetric MDS due to the smaller amount of assumptions made about 
the dissimilarity ratings. 


4 A Hybrid Approach 


Multidimensional scaling (MDS) is directly based on human similarity ratings and 
leads therefore to conceptual spaces which can be considered psychologically valid. 
The prohibitively large effort required to elicit such similarity ratings on a large 
scale however confines this approach to a small set of fixed stimuli. In Sect. 4.1, 
we propose to use machine learning methods in order to generalize the similarity 
spaces obtained by MDS to unseen stimuli. More specifically, we propose to use 
MDS on human similarity ratings to “initialize” the similarity space and artificial 
neural networks (ANNs) to learn a mapping from stimuli into this similarity space. 
We afterwards relate our proposal to two other recent studies in this area in Sect. 4.2. 


4.1 Our Proposal 


In order to obtain a solution having both the psychological validity of MDS spaces 
and the possibility to generalize to unseen inputs as typically observed for neural 
networks, we propose the following hybrid approach, which is illustrated in Fig. 5. 
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Fig. 5 Illustration of the proposed hybrid procedure: a subset of data is used to construct a con- 
ceptual space via MDS. A neural network is then trained to map images into this similarity space, 
aided by a secondary task (e.g., classification) 


After having determined the domain of interest (e.g., the domain of animals), one 
first needs to acquire a data set of stimuli from this domain. This data set should 
cover a wide variety of stimuli and it should be large enough for applying machine 
learning algorithms. Using the whole data set with potentially thousands of stimuli in 
a psychological experiment is however unfeasible in practice. Therefore, a relatively 
small, but still sufficiently representative subset of these stimuli needs to be selected 
for the elicitation of human dissimilarity ratings. This subset of stimuli is then used in 
a psychological experiment where dissimilarity judgments by humans are obtained, 
using one of the techniques described in Sect. 2.1. 

In the next step, one can apply MDS to these dissimilarity ratings in order to 
extract a spatial representation of the underlying domain. As stated in Sect. 2.2, one 
needs to manually select the desired number of dimensions—either based on prior 
knowledge or by manually optimizing the trade-off between high representational 
accuracy and a low number of dimensions. The resulting similarity space should 
ideally be analyzed for meaningful structures and a high correlation of inter-point 
distances to the original dissimilarity ratings. 

Once this mapping from stimuli (e.g., images of animals) to points in a similarity 
space has been established, we can use it in order to derive a ground truth for a 
machine learning problem: We can simply treat the stimulus-point mappings as 
labeled training instances where the stimulus is identified with the input vector and 
the point in the similarity space is used as its label. We can therefore set up a regression 
task from the stimulus space to the similarity space. 

Artificial neural networks (ANNs) have been shown to be powerful regressors 
that are capable of discovering highly non-linear relationships between raw low- 
level stimuli (such as images) and desired output variables. They are therefore a 
natural choice for this task. ANNs are however a very data-hungry machine learning 
method — they need large amounts of training examples and many training iterations 
in order to achieve good performance. On the other hand, the available number of 
stimulus-point pairs in our proposed procedure is quite low for a machine learning 
problem — as argued before, we can only look at a small number of stimuli in a 
psychological experiment. 
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We propose to resolve this dilemma not only through data augmentation, but also 
by introducing an additional training objective (e.g., correctly classifying the given 
images into their respective classes such as CAT and DOG). This additional training 
objective can also be optimized on all the remaining stimuli from the data set that 
have not been used in the psychological experiment. Using a secondary task with 
additional training data constrains the network’s weights and can be seen as a form 
of regularization: These additional constraints are expected to counteract overfitting 
tendencies, i.e., tendencies to memorize all given mapping examples without being 
able to generalize. 

Figure 5 illustrates the secondary task of predicting the correct classes. This 
approach is only applicable if the data set contains class labels. If the network is 
forced to learn a classification task, then it will likely develop an internal repre- 
sentation where all members of the same class are represented in a similar way. 
The network then “only” needs to learn a mapping from this internal representation 
(which presumably already encodes at least some aspects of a similarity relation 
between stimuli) into the target similarity space. 

Another secondary task consists in reconstructing the original images from a 
low-dimensional internal representation, using the structure of an autoencoder. As 
the computation of the reconstruction error does not require class labels, this is 
applicable also to unlabeled data sets, which are in general larger and easier to obtain 
than labeled data sets. The network needs to accurately reconstruct the given stimuli 
while using only information from a small bottleneck layer. The small size of the 
bottleneck layer creates an incentive to encode similar input stimuli in similar ways 
such that the corresponding reconstructions are also similar to each other. Again, 
this similarity relation learned from the overall data set might be useful for learning 
the mapping into the similarity space. The autoencoder structure has the additional 
advantage that one can use the decoder network to generate an image based on a 
point in the conceptual space. This can be a useful tool for visualization and further 
analysis. 

One should be aware that there is a difference between perceptual and conceptual 
similarity: Perceptual similarity focuses on the similarity of the raw stimuli, e.g., 
with respect to their shape, size, and color. Conceptual similarity on the other hand 
takes place on a more abstract level and involves conceptual information such as 
the typical usage of an object or typical locations where a given object might be 
found. For instance, a violin and a piano are perceptually not very similar as they 
have different sizes and shapes. Conceptually, they might be however quite similar 
as they are both musical instruments that can be found in an orchestra. 

While class labels can be assigned on both the perceptual (ROUND vs. ELON- 
GATED) and the conceptual level (MUSICAL INSTRUMENT vs. FRUIT), the reconstruc- 
tion objective always operates on the perceptual level. If the similarity data collected 
in the psychological experiment is of perceptual nature, then both secondary tasks 
seem promising. If we however target conceptual similarity, then the classification 
objective seems to be the preferable choice. 
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4.2 Related Work 


Peterson et al. (2017, 2018) have investigated whether the activation vectors of a 
neural network can be used to predict human similarity ratings. They argue that this 
can enable researchers to validate psychological theories on large data sets of real 
world images. 

In their study, they used six data sets containing 120 images (each 300 by 300 
pixels) of one visual domain (namely, animals, automobiles, fruits, furniture, vegeta- 
bles, and “various’”’). Peterson et al. conducted a psychological study which elicited 
pairwise similarity ratings for all pairs of images using a Likert scale. When apply- 
ing multidimensional scaling to the resulting dissimilarity matrix, they were able to 
identify clear clusters in the resulting space (e.g., all birds being located in a simi- 
lar region of the animal space). Moreover, when applying a hierarchical clustering 
algorithm on the collected similarity data, a meaningful dendrogram emerged. 

In order to extract similarity ratings from five different neural networks, they 
computed for each image the activation in the second-to-last layer of the network. 
Then for each pair of images, they defined their similarity as the inner product 
(u"v = )~"_, uivi) of these activation vectors. When applying MDS to the resulting 
dissimilarity matrix, no meaningful clusters were observed. Also a hierarchical clus- 
tering did not result in a meaningful dendrogram. When considering the correlation 
between the dissimilarity ratings obtained from the neural networks and the human 
dissimilarity matrix, they were able to achieve values of R? between 0.19 and 0.58 
(depending on the visual domain). 

Peterson et al. found that their results considerably improved when using a 
weighted version of the inner product ($`;_; w;u;vi): Both the similarity space 
obtained by MDS and the dendrogram obtained by hierarchical clustering became 
more human-like. Moreover, the correlation between the predicted similarities and 
the human similarity ratings increased to values of R? between 0.35 and 0.74. 

While the approach by Peterson et al. illustrates that there is a connection between 
the features learned by neural networks and human similarity ratings, it differs from 
our proposed approach in one important aspect: Their primary goal is to find a way 
to predict the similarity ratings directly. Our research on the other hand is focused 
on predicting points in the underlying similarity space. 

Sanders and Nosofsky (2018) have used a data set containing 360 pictures of rocks 
along with an eight-dimensional similarity space for a study which is quite similar 
in spirit to what we will present in Sect.5. Their goal was to train an ensemble of 
convolutional neural networks for predicting the correct coordinates in the similarity 
space for each rock image from the data set. As the data set is considerably too small 
for training an ANN from scratch, they used a pre-trained network as a starting point. 
They removed the topmost layers and replaced them by untrained, fully connected 
layers with an output of eight linear units, one per dimension of the similarity space. 
In order to increase the size of their data set, they applied data augmentation methods 
by flipping, rotating, cropping, stretching, and shrinking the original images. 
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Their results on the test set showed a value of R? of 0.808, which means that 
over 80% of the variance was accounted for by the neural network. Moreover, an 
exemplar model on the space learned by the convolutional neural network was able 
to explain 98.9% of the variance seen in human categorization performance. 

The work by Sanders and Nosofsky is quite similar in spirit to our own approach: 
Like we, they train a neural network to learn the mapping between images and a 
similarity space extracted from human similarity ratings. They do so by resorting to 
a pre-trained neural network and by using data augmentation techniques. While they 
use a data set of 360 images, we are limited to an even smaller data set containing 
only 64 images. This makes the machine learning problem even more challenging. 
Moreover, the data set used by Sanders and Nosofky is based on real objects, whereas 
our study investigates a data set of novel and unknown objects. Finally, while they 
confine themselves to a single target similarity space for their regression task, we 
investigate the influence of the target space on the overall results. 


5 Machine Learning Experiments 


In order to validate whether our proposed approach is worth pursuing, we conducted 
a feasibility study based on the similarity spaces obtained for the NOUN data set 
in Sect.3. Instead of training a neural network from scratch, we limit ourselves to 
a simple regression on top of a pre-trained image classification network. With the 
three experiments in our study, we address the following three research questions, 
respectively: 


1. Can we learn a useful mapping from colored images into a low-dimensional 
psychological similarity space from a small data set of novel objects for which 
no background knowledge is available? 

Our prediction: The learned mapping is able to clearly beat a simple baseline. 
However, it does not reach the level of generalization observed in the study of 
Sanders and Nosofsky (2018) due to the smaller amount of data available. 

2. How does the MDS algorithm being used to construct the target similarity space 
influence the results? 

Our prediction: There is are no considerable differences between metric and 
nonmetric MDS. 

3. How does the size of the target similarity space (i.e., the number of dimensions) 

influence the machine learning results? 
Our prediction: Very small target spaces are not able to reflect the similarity 
ratings very well and do not contain much meaningful structure. Very large tar- 
get spaces on the other hand increase the number of parameters in the model 
which makes overfitting more likely. By this reasoning, medium-sized target spaces 
should provide a good trade-off and therefore the best regression performance. 
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5.1 Methods 


Please recall from Sect.3 that the NOUN data base contains only 64 images with 
an image size of 300 by 300 pixels. As this number of training examples is too low 
for applying machine learning techniques, we augmented the data set by applying 
random crops, a Gaussian blur, additive Gaussian noise, affine transformations (i.e., 
rotations, shears, translations, and scaling), and by manipulating the image’s contrast 
and brightness. These augmentation steps were executed in random order and with 
randomized parameter settings. For each of the original 64 images, we created 1,000 
augmented versions, resulting in a data set of 64,000 images in total. We assigned 
the target coordinates of the original image to each of the 1,000 augmented versions. 

For our regression experiments, we used two different types of feature spaces: 
The pixels of downscaled images and high-level activation vectors of a pre-trained 
neural network. 

For the ANN-based features, we used the Inception-v3 network (Szegedy et al. 
2016). For each of the augmented images, we used the activations of the second-to- 
last layer as a 2048-dimensional feature vector. Instead of training both the mapping 
and the classification task simultaneously (as discussed in Sect. 4), we use an already 
pre-trained network and augment it by an additional output layer. 

As a comparison to the ANN-based features, we used an approach similar to 
the pixel baseline from Sect.3.2: We downscaled each of the augmented images by 
dividing it into equal-sized blocks and by computing the minimum (which has shown 
the best correlation to the dissimilarity ratings in Sect. 3.3) across all values in each 
of these blocks as one entry of the feature vector. We used block sizes of 12 and 
24, resulting in feature vectors of size 1875 and 507, respectively (based on three 
color channels for downscaled images of size 25 x 25 and 13 x 13, respectively). By 
using these two pixel-based feature spaces, we can analyze differences between low- 
dimensional and high-dimensional feature spaces. As the high-dimensional feature 
space is in the same order of magnitude as the ANN-based feature space, we can 
also make a meaningful comparison between pixel-based features and ANN-based 
features. 

We compare our regression results to the zero baseline which always predicts 
the origin of the coordinate system. In preliminary experiments, it has shown to 
be superior to any other simple baselines (such as e.g., drawing from a normal 
distribution estimated from the training targets). We do not expect this baseline to 
perform well in our experiments, but it defines a lower performance bound for the 
regressors. 

In our experiments, we limit ourselves to two simple off-the-shelf regressors, 
namely a linear regression and a lasso regression. Let N be the number of data 
points, t be the number of target dimensions, y the target value of data point i in 
dimension d, and Pa the prediction of the regressor for data point i in dimension d. 

Both of our regressors make use of a simple linear model for each of the dimensions 
in the target space: 
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Here, K is the number of a and x is the feature vector. In a linear least-squares 
regression, the weights wie of this model are estimated by minimizing the mean 
squared error between the T s predictions and the actual ground truth value: 
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As the number of features is quite high, even a linear regression needs to estimate 
a large number of weights. In order to prevent overfitting, we also consider a lasso 
regression which additionally incorporates the Lı norm of the weight matrix as 
regularization term. It minimizes the following objective: 
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The first part of this objective corresponds to the mean squared error of the linear 
model’s predictions, while the second part corresponds to the overall size of the 
weights. If the constant 6 is tuned correctly, this can prevent overfitting and thus 
improve performance on the test set. In our experiments, we investigated the follow- 
ing values: 


B € {0.0, 0.001, 0.002, 0.005, 0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1.0, 2.0, 5.0, 10.0} 


Please note that 8 = 0 corresponds to an ordinary linear least-squares regression. 

With our experiments, we would also like to investigate whether learning a map- 
ping into a psychological similarity space is easier than learning a mapping into an 
arbitrary space of the same dimensionality. In addition to the real regression targets 
(which are the coordinates from the similarity space obtained by MDS), we created 
another set of regression targets by randomly shuffling the assignment from images 
to target points. We ensured that all augmented images created from the same original 
image were still mapped onto the same target point. With this shuffling procedure, 
we aimed to destroy any semantic structure inherent in the target space. We expect 
that the regression works better for the original targets than for the shuffled targets. 

In order to evaluate both the regressors and the baseline, we used three different 
evaluation metrics: 


e The mean squared error (MSE) sums over the average squared difference 
between the prediction and the ground truth for each output dimension. 
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The mean euclidean distance (MED) provides us with a way of quantifying the 
average distance between the prediction and the target in the similarity space. 
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The coefficient of determination R? can be interpreted as the amount of variance 
in the targets that is explained by the regressor’s predictions. 
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We evaluated all regressors using an eight-fold cross validation approach, where 


each fold contains all the augmented images generated from eight of the original 
images. In each iteration, one of these folds was used as test set, whereas all other folds 
were used as training set. We aggregated all predictions over these eight iterations 
(ending up with exactly one prediction per data point) and computed the evaluation 
metrics on this set of aggregated predictions. 


5.2 Experiment 1: Comparing Feature Spaces and Regressors 


In our first experiment, we want to test the following hypotheses: 


1. 


The learned mapping is able to clearly beat the baseline. However, it does not 
reach the level of generalization observed in the study of Sanders and Nosofsky 
(2018) due to the smaller amount of data available. 

A regression from the ANN-based features is more successful than a regression 
from the pixel-based features. 

As the similarity spaces created by MDS encode semantic similarity by geometric 
distance, we expect that learning the correct mapping generalizes better to the test 
set than learning a shuffled mapping. 

As the feature vectors are quite large, the linear regression has a large number 
of weights to optimize, inviting overfitting. Regularization through the L; loss 
included in the lasso regressor can help to reduce overfitting. 
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Table 1 Performance of the different regressors for different feature spaces and correct versus 
shuffled targets on the four-dimensional space by Horst and Hout (2016). The best results for each 
combination of column and regressor are highlighted in boldface 


Regressor | Feature Targets Test set performance Degree of overfitting B 
space 
MSE MED R? MSE MED R? 
Baseline | Any Any 1.0000 1.0000 |- 
Linear ANN Correct 42.7093 2.6171 |- 
(2048) 
Shuffled 56.5298 —7.0475 |- 
Pixel Correct 2.6191 —1.5199 |- 
(1875) 
Shuffled 2.6629 —0.6700 |- 
Pixel Correct 2.3360 —2.2664 |- 
(507) 
Shuffled | 1.5853 1.2072 —0.5727 | 2.5049 | 1.6036 —0.6458 |- 
Lasso ANN Correct | 0.5740 0.7264 0.4111 | 28.3766 | 5.6239 2.3727 | 0.005, 
(2048) 0.01 
Pixel Correct | 0.9183 0.9391 0.0788 1.1320 | 1.1371 2.3313. | 0.2, 0.5 
(1875) 
Pixel Correct | 0.8946 0.9292 0.1015 1.1677 | 1.1251 2.2538 | 0.05, 0.1 
(507) 


5. For smaller feature vectors, we expect less overfitting tendencies than for larger 
feature vectors. Therefore, less regularization should be needed to achieve optimal 
performance. 


Here, we limit ourselves to a single target space, namely the four-dimensional 
similarity space obtained by Horst and Hout (2016) through metric MDS. 

Table 1 shows the results obtained in our experiment, grouped by the regression 
algorithm, feature space, and target mapping used. We have also reported the observed 
degree of overfitting. It is calculated by dividing training set performance by test set 
performance. Perfect generalization would result in a degree of overfitting of one, 
whereas larger values reflect the factor to which the regression is more successful on 
the training set than on the test set. Let us for now only consider the linear regression. 

We first focus on the results obtained on the ANN-based feature set. As we can see, 
the linear regression is able to beat the baseline when trained on the correct targets. 
The overall approach therefore seems to be sound. However, we see strong overfitting 
tendencies, showing that there is still room for improvement. When trained on the 
shuffled targets, the linear regression completely fails to generalize to the test set. 
This shows that the correct mapping (having a semantic meaning) is easier to learn 
than an unstructured mapping. In other words, the semantic structure of the similarity 
space makes generalization possible. 

Let us now consider the pixel-based feature spaces. For both of these spaces, 
we observe that linear regression performs worse than the baseline. Moreover, we 
can see that learning the shuffled mapping results in even poorer performance than 


30 L. Bechberger and K.-U. Kiihnberger 


learning the correct mapping. Due to the overall poor performance, we do not observe 
very strong overfitting tendencies. Finally, when comparing the two pixel-based 
feature spaces, we observe that the linear regression tends to perform better on the 
low-dimensional feature space than on the high-dimensional one. However, these 
performance differences are relatively small. 

Overall, ANN-based features seem to be much more useful for our mapping task 
than the simple pixel-based features, confirming our observations from Sect. 3. 

In order to further improve our results, we now varied the regularization factor 6 
of the lasso regressor for all feature spaces. 

For the ANN-based feature space, we are able to achieve a slight but consistent 
improvement by introducing a regularization term: Increasing 6 causes poorer per- 
formance on the training set while yielding improvements on the test set. The best 
results on the test set are achieved for 6 € {0.005, 0.01}. If 6 however becomes too 
large, then performance on the test set starts to decrease again — for B = 0.05 we 
do not see any improvements over the vanilla linear regression any more. For £ > 5, 
the lasso regression collapses and performs worse than the baseline. 

Although we are able to improve our performance slightly, the gap between train- 
ing set performance and test set performance still remains quite high. It seems that 
the overfitting problem can be somewhat mitigated but not solved on our data set 
with the introduction of a simple regularization term. 

When comparing our best results to the ones obtained by Sanders and Nosofsky 
(2018) who achieved values of R? ~ 0.8, we have to recognize that our approach per- 
forms considerably worse with R? x~ 0.4. However, the much smaller number of data 
points in our experiment makes our learning problem much harder than theirs. Even 
though we use data augmentation, the small number of different targets might put a 
hard limit on the quality of the results obtainable in this setting. Moreover, Sanders 
and Nosofsky retrained the whole neural network in their experiments, whereas we 
limit ourselves to the features extracted by the pre-trained network. As we are nev- 
ertheless able to clearly beat our baselines, we take these results as supporting the 
general approach. 

For the pixel-based feature spaces, we can also observe positive effects of regu- 
larization. For the large space, the best results on the test set are achieved for larger 
values of £ € {0.2, 0.5}. These results are however only slightly better than baseline 
performance. For the small pixel-based feature space, the optimal value of £ lies in 
{0.05, 0.1}, leading again to a test set performance slightly superior to the baseline. 
In case of the small pixel-based feature space, already values of 6 > 1 lead to a 
collapse of the model. 

Comparing the regularization results on the three feature spaces, we can conclude 
that regularization is indeed helpful, but only to a small degree. On the ANN-based 
feature space, we still observe a large amount of overfitting, and performance on 
the pixel-based feature spaces is still relatively close to the baseline. Looking at the 
optimal values of £, it seems like the lower-dimensional pixel-based feature space 
needs less regularization than its higher-dimensional counterpart. Presumably, this 
is caused by the smaller possibility for overfitting in the lower-dimensional feature 
space. Even though the larger pixel-based feature space and the ANN-based feature 
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space have a similar dimensionality, the pixel-based feature space requires a larger 
degree of regularization for obtaining optimal performance, indicating that it is more 
prone to overfitting than the ANN-based feature space. 


5.3 Experiment 2: Comparing MDS Algorithms 


After having analyzed the soundness of our approach in experiment 1, we compare 
target spaces of the same dimensionality, but obtained with different MDS algorithms. 
More specifically, we compare the results from experiment 1 to analogous procedures 
applied to the ANN-based feature space and the four-dimensional similarity spaces 
created by both metric and nonmetric SMACOF in Sect. 3. Table 2 shows the results 
of our second experiment. 

In a first step, we can compare the different target spaces by taking a look at the 
behavior of the zero baseline in each of them. As we can see, the values for MSE 
and R? are identical for all of the different spaces. Only for the MED we can observe 
some slight variations, which can be explained by the slightly different arrangements 
of points in the different similarity spaces. 

As we can see from Table 2, the results for the linear regression on the different 
target spaces are comparable. This adds further support to our results from Sect. 3: 


Table 2 Comparison of the results obtainable on four-dimensional spaces created by different 
MDS algorithms. Best results in each column are highlighted for each of the regressors 


Regressor | Target Test set performance Amount of overfitting B 
space 
MSE MED R? MSE MED R? 
Baseline Horst and 1.0000 
Hout 
Metric 1.0000 
SMACOF 
Nonmetric 1.0000 
SMACOF 
Linear Horst and 42.7093 
Hout 
Metric 42.2885 
SMACOF 
Nonmetric | 0.6086 0.7461 42.4380 6.8305 
SMACOF 
Lasso Horst and | 0.5740 0.7264 28.3766 5.6239 
Hout 
Metric 0.6052 0.7458 0.3880 35.0367 6.2463 2.5326 0.002 
SMACOF 


Nonmetric | 0.5938 0.7316 0.3853 29.1236 5.6497 2.5413 0.005 
SMACOF 
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Also when considering the usage as target space for machine learning, metric MDS 
does not seem to have any advantage over nonmetric MDS. 

For the lasso regressor, we observed similar effects for all of the target spaces: 
A certain amount of regularization is helpful to improve test set performance, while 
too much emphasis on the regularization term causes both training and test set per- 
formance to collapse. We still observe a large amount of overfitting even after using 
regularization. Again, the results are comparable across the different target spaces. 
However, the optimal performance on the space obtained with metric SMACOF is 
consistently worse than the results obtained on the other two spaces. As the space by 
Horst and Hout is however also based on metric MDS, we cannot use this observation 
as an argument for nonmetric MDS. 


5.4 Experiment 3: Comparing Target Spaces of Different Size 


In our third and final experiment in this study, we vary the number of dimensions 
in the target space. More specifically, we consider similarity spaces with one to ten 
dimensions that have been created by nonmetric SMACOF. Again, we only consider 
the ANN-based feature space. 

Table 3 displays the results obtained in our third experiment and Fig. 6 provides a 
graphical illustration. When looking at the zero baseline, we observe that the mean 
Euclidean distance tends to grow with an increasing number of dimensions, with an 
asymptote of one. This indicates that in higher-dimensional spaces, the points seem 
to lie closer to the surface of a unit hypersphere around the origin. For both MSE 
and R?, we do not observe any differences between the target spaces. 

Let us now look at the results of the linear regression. It seems that for all the 
evaluation metrics, a two-dimensional target space yields the best result. With an 
increasing number of dimensions in the target space, performance tends to decrease. 
We can also observe that the amount of overfitting is optimal for a two-dimensional 
space and tends to increase with an increasing number of dimensions. A notable 
exception is the one-dimensional space which suffers strongly from overfitting and 
whose performance with respect to all three evaluation metrics is clearly worse than 
the baseline. 

The optimal performance of a lasso regressor on the different target spaces yields 
similar results: For all target spaces, a certain amount of regularization can help to 
improve performance but too much regularization decreases performance. Again, we 
can only counteract a relatively small amount of the observed overfitting. As we can 
see in Table 3, again a two-dimensional space yields the best results. With respect 
to the optimal regularization factor 6, we can observe that low-dimensional spaces 
with up to three dimensions seem to use larger values of £ than higher-dimensional 
spaces with four dimensions and more. This difference in the degree of regularization 
is also reflected in the different degrees of overfitting observed for these groups of 
spaces. 
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Table 3 Performance of the zero baseline, the linear regression, and the lasso regression on target 
spaces of different dimensionality ¢ derived with nonmetric SMACOF, along with the relative 
amount of overfitting. Best values for each column are highlighted for each of the regressors 


Regressor |t Test set performance Amount of overfitting B 
MSE |MED |R? MSE |MED |R? 
Baseline 1 1.0000 | 0.8664 0.0000 | 1.0000 | 1.0000 1.0000 | — 
2 1.0000 | 0.9580 0.0000} 1.0000 | 1.0000 1.0000 | — 
3 1.0000 | 0.9848 0.0000} 1.0000 | 1.0000 1.0000 | — 
4 1.0000 | 0.9956 0.0000} 1.0000 | 1.0000 1.0000 | — 
5 1.0000 | 0.9966 0.0000 | 1.0000 | 1.0000 1.0000 | — 
6 1.0000 | 0.9973 0.0000} 1.0000 | 1.0000 1.0000 | — 
7 1.0000 | 0.9978 0.0000 | 1.0000 | 1.0000 1.0000 | — 
8 1.0000 | 0.9980 0.0000} 1.0000 | 1.0000 1.0000 | — 
9 1.0000 | 0.9982 0.0000 | 1.0000 | 1.0000 1.0000 | — 
10 1.0000 | 0.9984 0.0000 | 1.0000 | 1.0000 1.0000 | — 
Linear 1 1.1499 | 0.9046 | —0.1499 | 59.0040 | 8.3419 | —6.5413 |- 
2 0.4995 | 0.6370 0.5002 | 38.9046 | 6.5291 1.9734 |- 
3 0.5554 | 0.6979 0.4435 | 41.4309 | 6.7360 2.2243 | — 
4 0.6086 | 0.7461 0.3706 | 42.4380 | 6.8305 2.6585 | — 
5 0.6333 | 0.7692 0.3595 | 43.4577 | 6.9023 2.7405 | — 
6 0.6359 | 0.7734 0.3469 | 43.4900 | 6.8770 2.8397 | — 
7 0.6675 | 0.7956 0.3204 | 44.7364 | 6.9621 3.0741 |- 
8 0.6846 | 0.8094 0.3033 | 45.1247 | 6.9876 3.2459 | — 
9 0.6810 | 0.8078 0.2983 | 44.8367 | 6.9591 3.3004 | — 
10 0.7107 | 0.8259 0.2807 | 46.0530 | 7.0432 3.5076 | — 
Lasso 1 0.9912 | 0.8368 0.0088 | 1.3656 | 1.7043 | 30.9878 | 1,2 
2 0.4728 | 0.6052 0.5271 | 19.1298 | 4.5081 1.8504 | 0.02 
3 0.5322 | 0.6720 0.4722 | 19.4148 | 4.5725 2.0593 | 0.02 
4 0.5938 | 0.7316 0.3853 | 29.1237 | 5.6497 2.5413 | 0.005 
5 0.6180 | 0.7576 0.3755 | 35.1383 | 6.2167 2.6160 | 0.002 
6 0.6274 | 0.7651 0.3548 | 35.0732 | 6.1797 2.7724 | 0.001, 
0.002 
7 0.6589 | 0.7839 0.3280 | 39.6352 | 5.0619 2.9979 | 0.001, 
0.01 
8 0.6752 | 0.8022 0.3117 | 39.5496 | 6.5669 3.1527 | 0.001 
9 0.6680 | 0.7980 0.3108 | 38.8777 | 6.1359 3.1615 | 0.001, 
0.002 
10 0.6993 | 0.8166 0.2924 | 35.3561 | 5.5563 3.3513 | 0.002, 
0.005 
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Fig. 6 Visualization of the regression results for MSE, MED, and R? as a function of the number 


of dimensions 


Taken together, the results of our third experiment show that a higher-dimensional 
target space makes the regression problem more difficult, but that a one-dimensional 
target space does not contain enough semantic structure for a successful mapping. 
It seems that a two-dimensional space is in our case the optimal trade-off. However, 
even the performance of the lasso regressor on this space is far from satisfactory, 


urging for further research. 


6 Conclusions 


The contributions of this paper are twofold. 
In our first study, we investigated whether the dissimilarity ratings obtained 


through SpAM are ratio scaled by applying both metric MDS (which assumes a 
ratio scale) and nonmetric MDS (which only assumes an ordinal scale). Both MDS 
variants produced comparable results—it thus seems that assuming a ratio scale is 
neither beneficial nor harmful. We therefore recommend to use nonmetric MDS as 
its underlying assumptions are weaker. Future studies on other data sets obtained 
through SpAM should seek to confirm or contradict our results. 
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In our second study, we analyzed whether learning a mapping from raw images 
to points in a psychological similarity space is possible. Our results showed that 
using the activations of a pre-trained ANN as features for a regression task seems to 
work in principle. However, we observed very strong overfitting tendencies in our 
experiments. Furthermore, the overall performance level we were able to achieve is 
still far from satisfactory. The results by Sanders and Nosofsky (2018) however show 
that larger amounts of training data can alleviate these problems. Future work in this 
area should focus on improvements in performance and robustness of this approach. 

As follow-up work, we are currently conducting a study on a data set of shapes, 
where we plan to apply more sophisticated machine learning methods in order to 
counteract the observed overfitting tendencies. 
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1 Why Traditional Knowledge Representation Is 
Insufficient 


Future information systems, such as virtual assistants, augmented reality systems, and 
semi-autonomous or autonomous machines (Chan et al. 2009; Hermann et al. 2016), 
require access to large amounts of world knowledge in combination with sensor data. 
Consider a smart home scenario involving interconnected light bulbs. Here, a desired 
rule could be: “switch on the light in the hallway when somebody enters the home 
and set the light level in the hallway to below 50 lux.” In this scenario, there needs 
to be a common understanding (i.e., semantics) of all the information (concepts 
and facts) mentioned in this command between the user and the device, such as 
“light,” “hallway,” “50 lux,” but also of situational aspects, such as “when somebody 
enters the home.” In a Health 2.0 scenario, connected devices measure parameters 
concerning a patient’s health. The data need to be transformed (ideally automatically) 
into symbolically grounded knowledge and combined with the existing knowledge 
about health, diseases, and treatments (Henson et al. 2012). 
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These examples demonstrate that knowledge representation for Internet of Things 
scenarios is needed. Specifically, on closer inspection, they indicate that three aspects 
are particularly essential for the Internet of Things knowledge representation: 


1. How are perceptions and actions grounded and represented (in the Internet of 
Things terminology, sensors and actuators)? 

2. How can machines and humans agree on a common understanding when referring 
to concepts and facts, and how can this common understanding be shared? 

3. How can changes in the world be used in knowledge representation? 


In the past, research on knowledge representation in computer science has mainly 
focused on developing and using static ontologies (i.e., as a formal, explicit specifi- 
cation of a shared conceptualization in a domain of interest (Studer et al. 1998)) and 
knowledge graphs (Fensel et al. 2020; Farber et al. 2018). Ontological languages, 
such as the Resource Description Framework (RDF) (Cyganiak et al. 2014), RDF 
Schema (Brickley and Guha 2014), and the Web Ontology Language (OWL) (Bech- 
hofer et al. 2004), have been established to model parts of the world. To connect the 
world knowledge with sensor data, a few ad-hoc solutions have been proposed (e.g., 
Bonnet et al. 2000; Ganz et al. 2016, and Sect. 3). However, in our minds, all these 
technology is not capable of sufficiently incorporating the aspects of the Internet of 
Things as outlined above. 

In this chapter, we want to take up the previous considerations on knowledge 
representation in the context of the Internet of Things; we thereby make use of 
content from epistemology—particularly, the semantic theories—for our discussion 
on an optimal knowledge representation, addressing research question 1 “How can 
we formally describe and model concepts?” outlined in Chap. | of this book. We can 
show that the problem of knowledge representation for the Internet of Things is by no 
means trivial and that questions about concrete implementations lead to fundamental 
questions of knowledge representation, such as the symbol ground problem (Harnad 
1990) and the intersubjectivity problem (Reich 2010). 

The topic of this chapter is highly interdisciplinary. Consequently, it is written for 
a diversity of user groups: 


e Computer scientists, cognitive scientists, and IT practitioners who work on artifi- 
cial intelligence systems and who are particularly interested in knowledge repre- 
sentation for the Internet of Things (e.g., designing ontologies for the Internet of 
Things); 

Philosophers who want to make autonomous systems and the Internet of Things 
accessible for their theories. 


The chapter is structured as follows: After a detailed statement of the research 
problem in Sect. 1.1, we outline in Sects. 1.2 and 1.3 how our research problem is 
embedded in the scientific landscape of philosophy and computer science, respec- 
tively. In Sect. 2, we present a scenario in the Internet of Things context, which is used 
in the following sections to illustrate the concrete influences of theories of meaning 
on Internet of Things applications. Section3 is dedicated to several semantic theo- 
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ries originating from philosophy and how they can be used to address our research 
problem. The chapter finishes with a summary in Sect. 4. 


1.1 Problem Statement and Methodology 


Problem Statement. The Internet of Things (IoT) refers to the idea of the “‘perva- 
sive presence of a variety of things or objects around us—such as Radio-Frequency 
IDentification (RFID) tags, sensors, actuators, mobile phones, etc.—which, through 
unique addressing schemes, are able to interact with each other and cooperate with 
their neighbors to reach common goals” (Atzori et al. 2010). The Internet of Things 
has emerged as an important research topic and paradigm that can greatly affect a 
variety of aspects of everyday life. In the private setting, examples are smart homes, 
assisted living, and e-health. In the business setting, the Internet of Things is used, 
among other things, for automation and industrial manufacturing, logistics, and intel- 
ligent transportation. 

We focus on the connection between the Internet of Things and knowledge repre- 
sentation. As such, we consider intelligent agents—defined as objects acting ratio- 
nally (Russell and Norvig 2010) and often perceived as being identical to smart 
information systems—that 


e have sensors (e.g., cameras, microphones, sensors for temperature and humidity) 
that allow them to perceive the environment, 

e have actuators (e.g., displays, motors, light bulbs) that allow them to act in an 
environment, 

e have an interface (e.g., buttons and dials, speech) to communicate with humans 
and 

e can act semi-autonomously or autonomously. 


In the future, humans and agents will increasingly co-exist side by side. For 
instance, humanoid robots with conversational artifical intelligence capabilities 
might become omnipresent. Moreover, agents will communicate with each other 
and thereby exchange knowledge to accomplish tasks in an autonomous way. How- 
ever, obtaining a common understanding of the shared world and having the ability 
to refer to the same objects during communication is from an epistemological point 
of view nontrivial and by no means a matter of course. The crucial aspect in this con- 
text is the gap between the represented world (also called the model) and the actual 
world (see Fig. 1). It is related to mind-body dualism and specifically Descartes’ 
mind-body problem in philosophy (Skirry 2006). Agents have access to the outside 
world (typically called perception of the environment) and are able to trigger changes 
in the world via actuators (i.e., they can change the outside world). This aspect is also 
related to the following questions: How can someone obtain the meaning of a text in 
a language unknown to him or her? How can someone interact with people without 
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the ability to speak the language of the people (see the Chinese room argument (Cole 
2014))? 


Methodology. We will outline the possibilities of modeling things for scenarios in 
the world of the Internet of Thing. Given the Internet of Things, an environment in 
which agents are situated with other agents, a theory for knowledge representation 
on the Internet of Things needs to 


1. include sensors (perception) and actuators (action) in its formalization; 

2. include the notion of multiple subjects (machines and humans) that interact with 
the environment and with each other (intersubjectivity); 

3. include ways to describe changes in the represented world that should mirror 
changes in the real world, and vice versa (dynamics). 


Acquiring the correct underlying foundations—and, in philosophical terms, the 
correct conditions of possibilities for acquiring and exchanging knowledge—is cru- 
cial to enabling the manifold benefits that arise from increased automation and 
human-computer interaction. As an example, let us take one of the prominent sce- 
narios in the specific context of the Internet of Things—the so-called “onboarding” 
of devices. Onboarding is the process of connecting a sensor or a more complex 
Internet of Things device to the Internet and to a platform establishing an initial 
configuration and enabling services (Balestrini et al. 2017; Gupta and van Oorschot 
2019). This process can either be automated or involves broad communities of device 
owners. In both cases, the problems of device-platform communication and deciding 
on identifiers (how to address a specific new device) require the acceptance of an 
adequate theory of meaning in the open context system. Such a system interacts with 
the changing world and needs to adapt accordingly. 

This fact has already been noted by noteworthy philosophers and cognition sci- 
entists, such as (Gärdenfors 2000): 


When building robots that are capable of linguistic communication, the constructor must 
decide at an early stage how the robot grasps the meaning of words. A fundamental method- 
ological decision is whether the meanings are determined by the state of the world or whether 
they are based on the robot’s internal model of the world. (Gärdenfors 2000, p. 152) 


Gärdenfors does not describe scenarios involving intelligent agents and does not 
show how the perception layer of a robotic system fits into his model of geometric 
spaces, which is the problem we address in this chapter. Specifically, we focus on 
perception, multiple subjects, and world changes. 


1.2 Existing Solutions in Philosophy 


In philosophy, the study of what knowledge is and how it can be represented (i.e., 
epistemology) and the study of how to acquire knowledge from an environment (i.e., 
Philosophy of perception) are highly relevant to addressing the problem of knowledge 
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Theory 
(ax)@y)R(x, y) 
(vx)~R(x, x) 
(vx)(vy)(R(x, yp~R(y, x)) 
(vx)(vy)(R(x, ypC(x, y)) 
(vx)(vy)(C(x, ypC(y, x)) 


bT 


Denotation 
{True, False} 


Approximation 
{Good, Fair, Poor} 


Fig. 1 Mediated reference theories distinguish between the world and a model of the world. Direct 
reference theories, in contrast, do not distinguish between the model and the world (i.e., the model 
is the world; illustration adapted from Sowa (2005)) 


representation for the Internet of Things. From these research areas, we can highlight 
the following aspects. 


Theories of Meaning. Defining the meaning (particularly in the context of language 
also referred to as semantics) has always been an integral part of philosophy. In the 
20th century, philosophy shifted its focus to language and the role of language in 
understanding. Particularly noteworthy is the groundbreaking work of Gottlob Frege 
(1848-1925), which can be seen as the basis for many achievements in the area of 
artificial intelligence. Frege’s ideas come together in a mediated reference theory 
(see Fig. 1). 

Frege challenged the belief that the meaning of a sentence directly depends on the 
meaning of its parts. The meaning of a sentence is its truth value and the meaning 
of its constituent expressions is their reference in the extra-linguistic reality. First, 
he explored the role of the proper names (which have direct reference) and concepts 
(which gain meaning only when their direct referent is specified). He then studied 
identity statements (in the form of a = a or a = b) and came to the conclusion that 
direct reference theories do not adequately capture the meaning of identity statements. 
In particular, he pointed to the fact that the statements “Hesperus is the same planet 
as Hesperus” and “Hesperus is the same planet as Phosphorus” do not mean the 
same thing, even though the terms “Hesperus” and “Phosphorus” refer to the same 
extra-linguistic entity, the planet Venus. Thus, he came to an important distinction: 
the reference (Bedeutung) of a sentence is its truth value and the sense (Sinn) is the 
thought which it expresses. The questions that originated from Frege’s arguments 
gave rise to many theories of meaning in logic and computer science and contoured 
the definition of meaning we accept in this chapter. Overall, Frege as a philosopher 
provided categories that other scientists questioned and developed. 
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We define meaning pragmatically as follows: 


e The meaning of symbols. Following the idea that meaning is referential (see Gär- 
denfors 2000, pp. 151), the meaning of symbols (including names) are the objects 
to which they refer. For real-world entities, one can point directly towards the 
objects or mention their proper names. For abstract concepts (classes) and proper- 
ties, we take the Cartesian-Kantian two-world assumption as a basis and assume 
that, besides the real world we can see (res extensa or phaenomenon), there exists 
the world of ideas or thoughts (res cogitans or noumenon), and that classes and 
properties exist in this intellectual world. 

e The meaning of sentences. Symbols can be arranged together to express statements 
(used synonymously to facts; sentences are the written counterpart). Statements 
bring us to a new level of meaning, the level of truthfulness. Each sentence can be 
true or false. 


Theories of Truth. Given that statements can be true or false, questions of how 
statements stand in relation to the world and how statements can be tested concerning 
their truthfulness arise. Among the most commonly used theories of truth are as 
follow. 


e The correspondence theory of truth: This theory can be considered the most basic 
theory of truth. True sentences capture the current state of affairs—objectively, 
without an observer. This theory is very much based on the actual world. The 
theory does not consider linking new knowledge to existing knowledge of a subject 
and differentiating between the varying knowledge levels of different people and 
how these people agree on the same meanings. 

e The coherence theory of truth: This theory is coherence-centric and takes into 
consideration how new knowledge is incorporated into existing knowledge. State- 
ments are considered to be true if they are consistent with the statements (i.e., 
knowledge) obtained so far. 

e The consensus theory of truth (pragmatic theory of truth): This theory takes the 
different views of observers into consideration and is designed to align subjective 
views to other views. Statements are considered true if people (i.e., observers) 
agree on them. 


It becomes immediately clear that these theories of truth do not exclude each other but 
rather have different foci. We argue that a comprehensive theory needs to take all of 
the theories’ aspects into account. Particularly noteworthy is the fact that the theories 
focus mainly on knowledge and truth at a given point in time (see the construction of 
ontologies in computer science). Dynamic aspects, and thus the modeling of events, 
are insufficiently covered by these foundational theories. 
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1.3 Existing Solutions in Computer Science and Logic 


In computer science and cognitive sciences, specifically the fields of knowledge 
representation and logic, the problem of how to represent knowledge about the world 
for Internet of Things scenarios has been addressed to some degree. 


Theories of Meaning. In the past, computer scientists and logicians defined the 
meaning of objects in their knowledge representation models (e.g., ontologies) and 
methods for describing the world largely without an explicit connection to reality 
and perception. In particular, model theory (Tarski 1944) is the established way of 
defining the meaning of logic-based knowledge representation languages, such as 
the semantic web languages RDF, RDFS, and OWL. 

Moreover, in the area of knowledge representation, it became popular to use 
ontologies (Staab and Studer 2010) and knowledge graphs (Fensel et al. 2020) as 
world models. Freely available open knowledge graphs form the Linked Open Data 
(LOD) cloud, which is used in various applications nowadays (Farber et al. 2018). 
However, since logic and model theory are very formal disciplines, there was no 
need to link knowledge representation to perception. Works on ontology evaluation 
and ontology evolution consider the process of creating and evaluating ontologies 
(in the sense used in computer science, i.e., as a formal model of a small domain 
of interest) as finding the lowest common denominator for modeling parts of the 
world. However, researchers mainly discuss common and best practices a team of 
developers can use to create an ontology. Early attempts at defining an ontology 
which incorporate temporal dynamics were made by Grenon and Smith (2004) and 
Heflin and Hendler (2000). 

Overall, existing methods for modeling the world and defining meaning have the 
following drawbacks: (1) They disregard any explicit connection to reality. (2) They 
are omniscient and try to capture an (imposed) objective view of the world. (3) They 
are only able to express static knowledge but not changes in the world to a sufficient 
degree. In Sect. 1.2, we have carved out similar drawbacks regarding existing theories 
of meaning and truth. 

If symbols are only identifiers, how can our minds create a link to an object 
in the real world (or in our conceptual worlds of ideas or thoughts)? How can we 
make sure that other subjects/minds have the same meaning; that is, link to the same 
object (e.g., when we only mention the object’s identifier, such as http://dbpedia. 
org/resource/Karlsruhe or http://wikidata.org/entity/Q1040)? Is the meaning directly 
connected (grounded) to non-symbols? This problem is known as the symbol ground- 
ing problem: “How can you ever get off the symbol/symbol merry-go-round? How 
is symbol meaning to be grounded in something other than just more meaningless 
symbols?” (Harnad 1990). In the Semantic Web and Linked Data context, URIs are 
used as symbols for objects. The symbol-grounding problem is not often considered 
(Cregan 2007) or even solved. In particular, the aspects of perception, multiple sub- 
jects, and changes in the world—the focus in our chapter—for knowledge represen- 
tation are not covered sufficiently. In the Internet of Things domain, we find only a 
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few works in this respect, such as the article by Hermann et al. (2017), who present 
grounded language learning in a 3D environment. 


Theories of Truth. Theories of truth are traditionally proposed in philosophy. When 
we apply the theories of truth as introduced in Sect. 1.2 to the established and widely 
used semantic web technologies, such as RDF and OWL, and to knowledge rep- 
resentation ideas like knowledge graphs and linked open data, we can observe the 
following: (1) The RDF data model (Hayes and Patel-Schneider 2014) might fit to 
the correspondence theory of truth and to the consensus theory in the context of the 
Internet of Things. (2) Linked data can be regarded as an implementation of the con- 
sensus theory in the sense that data publishers and data consumers need to agree on 
common terms to use the linked data in a reasonable way. However, applications in 
the Internet of Things require more, since the (linked) data are subjected to changes 
over time and dependent on the perception (see, e.g., the sensor data from devices). 

In recent years, approaches based on neural networks have been presented to 
represent entities and relations in knowledge graphs—as an implementation of a 
knowledge representation—in the form of vectors in a low-dimensional vector space 
(called embeddings Mikolov et al. 2013; Wang et al. 2017). Apart from the context 
of the entities and relations in the knowledge graph, external data sources have also 
been used to build these implicit knowledge representations. For instance, data from 
several modalities (text, images, speech, etc.) can be combined to form a unified, 
comprehensive representation in a low-dimensional space (Bruni et al. 2014). In the 
Internet of Things context, the representations are created based on sensor data, and 
thus, perceptions. We can argue that the formal method and technology of obtaining 
the sensory data (e.g., images, text, etc. of an object) and of transforming it into a 
common vector space (e.g., via machine learning techniques) has a direct influence 
on the meaning of objects or even constitutes the meaning itself. 


2 Motivating Scenario 


In this section, we describe a smart home scenario, which will be used in the upcoming 
sections as an example of an Internet of Things scenario. It will show how the theories 
of meaning considered by us affect the way of modeling knowledge. 

Consider the home of Alice (see Fig. 2) with four rooms: the hallway, the living 
room, the bathroom, and the bedroom. Each of the rooms is equipped with a light 
bulb that can be controlled via a network interface. Each room also has a window 
with controllable window blinds. Moreover, each room has a sensor to measure the 
light levels. The door has a sensor that detects when it is opened. A virtual assistant 
called Bob provides a user interface to the smart home via speech interaction. The 
more data and knowledge about the smart home is coupled with the virtual assistant, 
the more generic and flexible the virtual assistant needs to be. 

Considering this scenario, we can point out several issues with respect to knowl- 
edge representation. The first issue concerns naming. Both Alice and Bob have to 
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Fig. 2 A smart home with 2s Window 
various sensors (i.e., p sl 
luminance sensors and door 
sensors) and actuators (i.e., 
network-controlled lamps x x Lamp 
and window blinds) 


agree on the meaning of “the living room,” so that Alice can ask Bob, “Is the light on 
in the living room?” Similarly, to affect a change in the world, Alice and Bob have 
to agree on names as references to objects, so that Alice is able to tell Bob to “switch 
off the light in the living room.” A more elaborate command could be to “switch on 
the light in the hallway when somebody enters the home and set the light level in the 
hallway below 50 lux.” 

We assume that a shared understanding between the virtual assistant and the 
human user has to be configured when setting up the smart home (the so-called 
“onboarding problem”). The problem also arises when a new human user wants to 
interact with the smart home (e.g., when Carol visits Alice and wants to turn on the 
lights). 

We can think of various other Internet of Things scenarios in which theories of 
meaning (also called semantic theories) become important for modeling the sce- 
narios. For instance, in a Health 2.0 scenario as outlined by Henson et al. (2012), 
the sensor data gathered by Internet of Things devices need to be collected and 
transformed into symbolic information. This transformation allows the system to 
interpret the information and combine it with other existing, symbolically grounded 
knowledge (e.g., about diseases). Questions concerning the representation of per- 
ception, the inter-subjective agreement of concepts and facts, and the representation 
of dynamically changing knowledge arise. 


3 Applying Theories of Meaning to the Internet of Things 


Several theories of meaning have been proposed to link the real world with actual 
knowledge about it. In this section, we review the following semantic theories: 


1. Model-theoretic semantics; 

2. Possible world semantics; 

3. Situation semantics; 

4. Cognitive and distributional semantics. 
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These semantics have been chosen due to their popularity and “baselines” in pre- 
vious work (Gardenfors 2000, pp. 151). The first formalism is sometimes referred to 
as “extensional semantics” and the second formalism is referred to as “intensional 
semantics.” Furthermore, some authors, such as Gärdenfors (2000), refer to “exten- 
sional semantics” instead of model theory and “intensional semantics” instead of 
“modal logic.” Given the various and sometimes incompatible uses of “extensional” 
and “intensional” in the literature (Janas and Schwind 1979; Helbig and Glockner 
2007; Lanotte and Merro 2018; Franconi et al. 2013), we use the terms “model- 
theoretic semantics” and “possible world semantics” for clarity. 

In the following sections, we cover each theory of meaning in detail and apply 
it to the Internet of Things. Within each section, we first give a definition of the 
theory and outline its characteristics. Subsequently, we describe how the theory can 
be applied to model Internet of Things scenarios. We thereby focus primarily on 
the perception, intersubjectivity, and dynamics, because modeling these aspects is 
particularly crucial in the context of the Internet of Things (see Sect. 1). 


3.1 Model-Theoretic Semantics for the Internet of Things 


3.1.1 Definition and Current Use 


Model-theoretic semantics can be encoded in various ways. In the following, we 
assume that the knowledge in embodied systems (e.g., a smart home) is described 
using sentences in first-order (predicate) logic. The meaning (i.e., the truth value) of 
the sentences is given via mapping to a world represented using set theory. 

Extensional semantics is considered one of the realistic theories on seman- 
tics (Gärdenfors 2000). Expressions (names) are mapped to objects in the world 
(see the theory of correspondence). Predicates are then applied to a set of objects or 
relations between objects. Generally, using such a map, sentences can be assigned 
true/false values (see truth conditions). The “extension” of the sentence “Lassie is 
famous” is the logical value “true,” since Lassie is famous. There is no anchoring of 
the language in a body (i.e., the meaning of words is modeled independently of indi- 
vidual subjects). This is known as the human capability of abstraction. All sentences 
being true constitute the world. 

First-order predicate logic provides the foundation for formalizing current Seman- 
tic Web languages, such as RDF, RDFS, and OWL. 


3.1.2 Application to the Internet of Things 
While the languages with a formalization in model theory are mature and widely 


used, they do not cover the dimensions required in scenarios around the Internet of 
Things as outlined in the following: 
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Perception 


The set-theoretic structure representing the world does not have any connection to 
the external world. Whether or not the term “Lassie” refers to Lassie the dog in the 
external world does not have any bearing on the truth value of the sentence. However, 
such a connection is needed to take perception (e.g., sensor data) in Internet of Things 
scenarios into account for modeling the world. 


Intersubjectivity 


The theory does not address the problem of reaching agreement on the meaning of 
terms across different agents. For instance, in the case of the semantic web languages 
RDF, RDFS, and OWL, there exists no defined mechanism that ensures different 
agents have the same notion of terms and sentences. Finding a shared understanding 
is left to the agents. 


Dynamics 


Traditional first-order predicate logic was developed to describe properties of things. 
That is, one can name things (“Lassie”) and assign properties to them (“is famous”). 
The focus of such representations is to deduce new declarative sentences based on 
the given sentences. Some applications use first-order logic to represent events (e.g., 
“Lassie rescues the girl from drowning”), where the event (“rescuing”) is treated as 
a property. While such representations might be suitable for some derivations, they 
do not cover the dynamics behind events sufficiently for scenarios in the Internet of 
Things. 


Benefits and Limitations for the Internet of Things. The focus of model theory is 
to provide a notion of truth of sentences that allows for the specification of logical 
consequence. Logical consequence can help one check for satisfiability of sentences 
with regards to the world. It provides means to integrate data from multiple sources. 
However, model theory does not consider many aspects relevant in the Internet of 
Things, such as the connection of symbols and sentences to the real world or the 
question of how multiple agents can agree on the meaning of symbols. Furthermore, 
model theory lacks means to adequately formalize change, since the sentences are 
classically interpreted over a static model of the world. 


3.2 Possible World Semantics for the Internet of Things 


3.2.1 Definition and Current Use 


The origins of possible world semantics can be traced back to Carnap (1947), 
Kripke (1959), and Montague (1974). Without loss of generality, we assume for 
the remaining part of the chapter that the possible world semantics are implemented 
via modal logic. More on the idea of possible worlds as the conceptual underpinning 
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of the modal logics can be found in Hughes et al. (1996) and Menzel (2017). In the 
following, we review the modal logic and its applicability for the Internet of Things 
scenarios. 

With modal logic, expressions are mapped to a set of possible worlds, instead of 
a single world. Otherwise, the setting is the same as for the extensional semantics 
theory: sentences can have “truth conditions”, and each proposition (sentence) has 
worlds in which it holds true. 

To model these possible worlds, modal logic adds two new unary operators: 
(“necessary”) and © (“possibly”) to the set of Boolean connectors (negation, dis- 
junction, conjunction and implication). The proposition is possible, if a world may 
exist in which this proposition is true. The proposition is necessary, if it has to be 
true in all worlds. 

Dependent on the application context, modal operators can have different intuitive 
interpretations. For example, if one wants to represent temporal knowledge, D future P 
may mean that proposition P is always true in the future and that Ọ future P means P 
is sometimes true in the future. These different ways to interpret modal connectives 
give rise to various types of modal logics: tense, epistemic, deontic, dynamic, geo- 
metric, and others (see more in Goldblatt (2006)). Thus, they represent facts that are 
“necessarily/possibly” true, true “today/in the future”, “believed/known”’ to be true, 
true “before/after an action”, and true “locally/everywhere.” 


3.2.2 Application to the Internet of Things 


We see many possibilities to use modal logic to capture the semantics in Internet 
of Things scenarios. As an example, Fig. 3 shows a system that interprets the voice 
input “Turn on the light” and acts differently depending on the location of the user. 
We can also consider such parameters as time of day and define different scenarios 
with temporal logics. 

Modal logic as a kind of formal logic extends predicate logic by allowing it to 
express possibilities. Modal logic has mainly been used in formal sciences, such as 
logic (e.g., “ontology of possibilities”). However, it has not been applied extensively 
in computer science and, specifically, in Internet of Things contexts. We can observe 
that modal logic as an implementation of possible world semantics is better suited 
to the Internet of Things than model-theoretic semantics. However, modal logic is 
not perfectly suited for modeling knowledge of Internet of Things agents. This can 
be demonstrated by evaluating perception, intersubjectivity, and dynamics. 


Perception 


Similar to first-order predicate logic with a model-theoretic formalization, modal 
logic does not have any connection to the external world. 


Intersubjectivity 


Modal logic and its semantics are still based on a realistic idea (i.e., coordinating 
extra-linguistic entities to linguistics expressions). However, subjects’ interpretations 
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Fig. 3 Possible worlds in the smart home scenario 


of the world can be represented as distinct worlds. In this way, modal logic allows 
us to model multiple worlds and to represent the knowledge of several agents (i.e., 
subjects). 


Dynamics With the ability to add temporal operators, modal logic allows us to keep 


track of states of resources over time. Furthermore, with the ability to keep track 
of state over time, one can detect events (i.e., state changes) and thus represent 
knowledge evolving over time. 


Benefits and Limitations for the Internet of Things. The focus on logical con- 
sequence of sentences is one of the properties that possible world semantics shares 
with model-theoretic semantics. Neither has an explicit connection to the real world. 
Modal logic as an implementation of possible world semantics stands out from imple- 
mentations of model-theoretic semantics by taking the aspects of intersubjectivity 
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and dynamics into account. Nevertheless, the possible world semantics only provide 
means to describe a changing world with sentences and to reason over such sentences, 
but not to actually affect changes in the world. 


3.3 Situation Semantics for the Internet of Things 


3.3.1 Definition and Current Use 


The theory of situation semantics, another kind of realistic semantics, was developed 
by Jon Barwise and John Perry in their seminal book Situations and Attitudes (1983). 
In contrast to its predecessor possible worlds semantics, it postulates the principal 
of partiality of information available about the world. Limited parts of the world 
that are “clearly recognized [...] in common sense and human language” and “can 
be comprehended as a whole in [their] own right” (Barwise and Perry 1980) are 
called situations. Situations stand in contrast to processes and activities. According 
to (Galton 2008): 


I believe that open processes and closed processes are very different kinds of things. The fact 
that we use the word ‘process’ for both of them perhaps lends some support to Sowa’s use 
of this word as the most inclusive term, corresponding to what others have called situations 
or eventualities. 


Devlin (2006), who formalized the basic notions of situation semantics and 
extended it to situation theory, emphasizes that information is always given “about 
some situation.” It is constructed from discrete information units, called infons. An 
infon (øo) is a relational structure of shape, ((R,a1,...,@,,0/1)), where R is an 
n-place relation, a1, ..., an are objects appropriate for the argument roles i4, ... , in, 
and 0/1 are the polarity values indicating whether or not the objects a1, . . . , ad, stand 
in the relation R. 

Objects in the argument roles of an infon include individuals, properties, relations, 
space-time locations, situations, and parameters. Parameters in situation semantics 
act as variables (i.e., they reference arbitrary objects of a given type). To set parame- 
ters to concrete real-world entities, Barwise and Perry (1983) introduce an assignment 
mechanism called an anchor. 

Unlike model-theoretic or possible worlds semantics, situation theory claims that 
an infon—roughly corresponding to a fact or statement—can be true (or false) only 
in the context of a particular situation. This relationship is written as s = o (read as 
“s supports o”), meaning that the fact represented by infon o holds true in situation s. 

Figure 4 shows an illustrative example of situation semantics for the Internet of 
Things scenario smart home. In this figure, we can see a limited part (s) of the 
world where we can distinguish several classes of objects: WindowBlind, Room, 
and LightBu1b. Potentially, instances of these classes can be involved in many 
situations. One of them (i.e., TriggerBlindsUp) is that when it is dark in the room 
and already light outside in the morning, the window blinds are automatically raised 
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Fig. 4 TriggerBlindswUp situation in the smart home scenario 


by the control system. We represent the relevant relations (isDayTime, tooDark, etc.) 
with the following infons where parameters / and ¢ reference arbitrary spatial and 
temporal locations: 


(641) ((sOff, lb, i, i, 1)), where parameter lb anchors objects of type 
LightBulb; 

(oa2)  ({isDayTime, /, 7, 1)); 

(0q3)  ((isDown, wb, i, i, 1)), with wb anchoring WindowB1ind instances; 

(oi)  (({tooDark, +, li, 1)), with 7 anchoring Room instances; 

(o;2) ((blindsUpNeeded, wb, i , Å, 1)), with anchoring WindowB1 ind instances. 


By using conjunction, disjunction, and anchoring, we can combine infons into more 
complex structures (i.e., compound infons). For situation TriggerBlindsUp, 
the infons form the compound infon: s = og, A Oa2 A 043 A Oii A oiz. The system 
that relies on this formalism can check whether these infons support the situation 
TriggerBlindsUp and use actuators to trigger the change in the real world. 

Situation semantics distinguishes three types of situations: utterance situation 
(i.e., the immediate context of utterance, including a speaker and a hearer), focal 
situation (i.e., the part of the world referred to by the utterance), and resource situation 
(i.e., the situation used to support or to reason about focal or utterance situations 
(Devlin 2006)). 

Meaning is acquired by linking utterances expressed in language to objects in the 
real world. This link, called the “speaker’s connection” (Barwise and Perry 1983), 
determines the unique role of a subject in this theory: It is the agent who establishes 
such a link, and meaning is thus made relative to a specific agent. Figure 4 illus- 
trates this possibly changing perspective. The subject perceives the room as dark: 
((tooDark, r, i ,t, 1)); one can imagine another subject for whom the polarity of the 
infon o;; would be 0. 

In the area of the Internet of Things, certain information systems employ situation 
semantics as the core of their modeling of user behavior and sensor observations, 
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as well as the basis of context- and situation-awareness (see Heckmann et al. 2005; 
Kokar et al. 2009; Stocker et al. 2014, 2016). In the following, we will refer to these 
systems to show how situation semantics addresses the problems of perception, 
intersubjectivity, and representing dynamics. 


3.3.2 Application to the Internet of Things 


In the process of measurement, sensors transform signals of physical properties 
into numbers, thus generating numerical data. These data are challenging to store 
and manage and require near-instant access. The interpretation of the raw values 
requires modeling, finding patterns, and deriving abstractions. Abstractions reveal 
the properties of the observed real-world entities, show their dynamics, and place 
them into relations with their surroundings. 


Perception 


Sensor networks cannot perceive (“observe”) situations directly; instead, as shown 
in Fig. 5, several components are needed to derive decisions and to take actions (see 
Kokar et al. 2009; Stocker et al. 2014). The process can be described as follows: 
The system takes sensor data as input, which then undergo the semantic enrichment 
process. Semantically annotated data is then transformed via a rule-based inference, 
digital signal processing, or machine learning algorithms into higher-level abstrac- 
tions. These abstractions can be considered situations, which in turn can trigger 
actions and enable intelligent services. Both Stocker et al. (2014) and Kokar et al. 
(2009) exemplify how sensor input is transformed into a set of infons (called observed 
or asserted in Kokar et al. (2009)) and how new inferred infons are derived from 
them. 

Situation semantics, therefore, works as a compliment to the algorithms that can 
directly process data generated in the perception layer. It is the way to organize sen- 
sory input in a task or goal-oriented environment. In addition, Stocker et al. (2014) 
argue that the persistence of situational knowledge in many cases is a desirable alter- 
native to the persistence of sensor data and the key enabler of useful perceptual data 
in real time. Henson et al. (2012) describe an approach for deriving abstractions— 
essentially similar to situations—from sensory observations. 


Semantic 


Measurements é Abstractions Decisions/Actions 
Enrichment 


Annotation Signal Processing/Rules/Machine Learning 


Fig. 5 Generic components of a system consuming sensor data 
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Intersubjectivity 


In situation semantics, any relation between a real-world situation and its representa- 
tion in a formal framework is relative to a specific subject. An agent recognizes or, in 
the terminology of Barwise and Perry (1983), “individuates” situations. Assigning 
values to certain parameters in the argument roles of an infon is always done by 
a particular subject. Situation semantics has an inherent mechanism to encode the 
subject’s perspective, as well as to represent and to coordinate views of multiple 
subjects. The Internet of Things is often treated as a decentralized distributed sys- 
tem (Singh and Chopra 2017) where different agents generate situational knowledge 
individually. In this context, formalizing situation semantics can ease inter-agent 
communication and data integration (see the discussion in Stocker et al. (2014)). 


Dynamics 


Having situation as its central concept, situation semantics considers static repre- 
sentation of situations (as objects and their relations) and their dynamic aspect. 
According to Barwise and Perry (1983), “Events and episodes are situations in 
time ... changes are sequences of situations.” As a consequence, this theory has 
a built-in mechanism for representing temporal and spatial dynamics; namely, it 
introduces special types of objects that can fill argument roles of an infon (i.e., TIM, 
the type of a temporal location, and LOC, the type of a spatial location (Devlin 
2006)). Thus, it is possible to represent whether a relation holds between the objects 
at a particular time in a particular location. 

Stocker et al. (2014) use situation semantics to model observed situations in 
a road traffic scenario. By analyzing the road-pavement vibration data from three 
accelerometer sensors, they were able to detect vehicles in the proximity of sens- 
ing devices (near-relation) and their types (light or heavy). Observations, classi- 
fied by the signal processing algorithms and modeled as sets of infons of shape, 
((near, Vehicle,, lx, tx, 1)), enabled the inference of the velocity and the driving 
side of a vehicle via a custom set of rules. This example shows that this kind of rep- 
resentation is suitable for time-oriented data. Time-oriented data is a characteristic 
of most of the data generated in the Internet of Things (see more in Serpanos and 
Wolf (2017)). 


Benefits and Limitations for the Internet of Things. Barwise and Perry were not 
the first to include situations as first-class citizens into a knowledge representation 
theory (see, e.g., situation calculus McCarthy 1963; McCarthy and Hayes 1969). 
Nevertheless, compared to its predecessors, situation semantics presents a richer 
formalism capable of representing higher level abstractions over raw sensor data, 
multiple viewpoints, and temporal-spatial dynamics. 

Infons with their argument role structures can be reused across related situation 
types (e.g., how easy it will be to project the set of infons of the TriggerBlindUp 
situation to TriggerBlindDown). In many Internet of Things scenarios, the stor- 
age of raw data is not optimal due to the quantity and limitations of existing storage 
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solutions. Having a system as described in this section will allow us to store more 
meaningful and actionable pieces of information (situations) for certain signals to 
act upon in real time. 


3.4 Cognitive and Distributional Semantics for the Internet 
of Things 


3.4.1 Definition and Current Use 


Cognitive semantics needs to be considered with respect to the general notion of 
cognition: instead of a subject perceiving the world with his senses and with language 
as the subject’s ability to talk about the world, the focus is shifted to the mental 
representation of the world (i.e., to the subject’s cognitive structures). Moreover, 
language becomes part of the cognitive structure. As such, concepts are elements in 
the subject’s cognitive structure and without a direct reference to a reality. Thus, the 
meaning of concepts, etc., does not go beyond language, but is nothing else than using 
the language itself (see Ludwig Wittgenstein’s theory of language) and therefore, the 
cognitive structures. These cognitive structures are subject to constant adaptation due 
to the interaction with the world. For instance, new concepts are learned and new 
findings are obtained. The world becomes viable. Overall, cognitive semantics is 
categorized as a non-realistic theory of semantics due to the exclusion of reality. 

Focusing on the subject’s cognitive structures, the question becomes what these 
cognitive structures look like and how they are created. Motivated by the biology 
of the human brain as the basis for any human’s cognitive ability (Gärdenfors 2000, 
p. 257), neural networks and their mechanisms are typically considered the basis 
for cognition. Inputs, outputs, and internal representations of neural networks are 
modeled mathematically as geometrical (vector) spaces. Vector spaces are therefore 
used to represent things in the world, such as entities, concepts, and relations. Thus, 
knowledge is represented as distributional representations (e.g., embeddings) on a 
sub-symbolic level. Meaning is formalized as and reduced to a distance function. 
Similar objects tend to be spatially closer to each other in the vector space induced 
by the used neural network. Semantics is considered to be distributional (leading to 
the term distributional semantics), geometrical, and statistical. 

Cognitive semantics and distributional semantics is not a new phenomenon: In 
1954, Harris (1954) proposed that meaning is a function of distribution (see the 
famous quote: “a word is characterized by the company it keeps” (Harris 1954)). Con- 
temporary philosophers and cognitive scientists use geometrical spaces to explain 
cognition and how concepts are formed by subjects. Gärdenfors (2000), for instance, 
considered the geometry of cognitive representations. In this cognitive space, points 
denote objects, while regions denote concepts (see Fig. 6 and book Chap. 2 for more 
information about Gardenfors’ cognitive framework). 
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Fig. 6 Low-dimensional vector space representation in the smart home scenario with instances 
represented as points, concepts represented as areas, and predicates (relations) represented as vectors 


Artificial neural networks have been used to simulate neural networks, and thereby, 
cognition. With the revival of research in artificial neural networks in recent years, 
research has been performed on how representations for terms, concepts, and predi- 
cates can be learned automatically (see, among other things, the approaches TransE 
and TransH (Wang et al. 2014)). The idea is to use the weights to the hidden layers 
of neural networks as representation (called embeddings). Guha (2015) proposed a 
model theory based on embeddings and adapted the Tarski model theory to embed- 
dings. 

In recent years, knowledge graph entities and relations (i.e., explicit knowledge 
representation formats) have also been embedded, showing that not only expressions 
can be represented in a distributed fashion, but also concepts and entities, as well 
as classes and relations. This allows us to model human cognition in a more natural 
way, because embeddings are learned for specific symbols. 


3.4.2 Application to the Internet of Things 


We assume that cognitive items, such as concepts, are represented in a sub-symbolic 
fashion, specifically, distributional semantics. Concepts are thus represented in a 
geometrical space. We use neural-network-based embedding methods as concrete 
implementation for distributional semantics. Figure 6 shows an example of repre- 
senting items for the smart home scenario. Distributional semantics is amenable to 
modeling perception, intersubjectivity, and dynamics in the following respect: 


Perception 


Distributional semantics differs (with respect to perception) from other semantic 
theories in several ways: 


e The meanings of concepts and facts are represented in a distributed fashion, not 
as singular units or symbols. In the smart home scenario, specific light bulbs and 
rooms are encoded as embedding vectors (see Fig. 6). 

e The representations and, thus, the meanings of concepts are not static, but can be 
subject to constant change. To overcome this issue, time-dependent embeddings 
can be learned (Nguyen et al. 2018). In the smart home scenario, agents can learn 
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the embeddings of the different light bulbs and the embedding space of light bulbs 
per se based on the sensor data used as input for a neural network. 

e A symbol grounding is possible as long as some form of input data (i.e., sensor 
data) is provided (and does not change abruptly). 

e There are indications that human cognition aggregates the perceptions of different 
modalities of one unit (e.g., concept or concrete entity). For instance, the image 
of a dog and the sound of a dog are immediately perceived as belonging to the 
same unit. The same phenomenon can be observed when a multilingual person 
switches between languages whilst referring to the same concepts. As is the case 
with embedding methods from machine learning, such a fusion of sensor data from 
different modalities is possible. In the Internet of Things context, an embedding 
vector can be learned jointly based on different modalities. 


Overall, perception is reduced to learning embeddings. 
Intersubjectivity 


Talking about and reaching an agreement on expressions between several agents can 
be traced back to using the same learned representations (i.e., embedding vectors) 
and the same conceptual structures (i.e., the distributional space). Even if different 
initialization values for the embedding spaces are given, subjects can use the same 
learning function to learn the same concepts. In the smart home scenario, the agents 
might differ in the exact points of the single light bulbs and rooms, since they rely 
on their own embedding learning and usage. However, they can agree on the same 
instances and concepts if the embeddings share the same characteristics (e.g., hav- 
ing nearly the same distances to other embeddings in the vector space). Overall, 
learning representations and meaning are reduced to learning and applying the same 
mathematical functions and models. 


Dynamics 


Describing changes in the world, such as events, is not sufficiently possible in the cog- 
nitive theory of semantics. If embeddings as distributed representations are learned 
or adapted online (i.e., in a permanent fashion, not only once at the beginning), then 
changes in the world may change the embeddings. However, the change itself is 
not represented. In the smart home scenario, an event might be light bulb number 4 
switching on. The concepts involved in this event, such as light bulb #4, the positions 
of the light bulbs, and room #1 remain the same. 


Benefits and Limitations for the Internet of Things. A characteristic of cogni- 
tive/distributional semantics is that information, such as concepts and facts, is not 
represented in the form of symbols, but in a sub-symbolic fashion as points and 
spaces in a vector space. This allows a more continuous distance function and an 
agreement on concepts and facts in the world as a continuous process. Talking about 
and reaching an agreement on expressions between several subjects can be traced 
back to using the same learned representations (i.e., embedding vectors) and the same 
conceptual structures (i.e., the distributional space). Thus, distributional semantics 
is heavily based on mathematics, which benefits the modeling of data in the Internet 


Theories of Meaning for the Internet of Things 57 


of Things setting. However, describing changes in the world, such as events, is not 
sufficiently possible in the cognitive theory of semantics. 


4 Conclusion 


In this chapter, we have considered the theoretical foundations for representing 
knowledge in the Internet of Things context. Based on the peculiarities of the Internet 
of Things, we have outlined three dimensions that must be examined with respect to 
theories of meaning: 


1. Perception: How can a theory of meaning incorporate “direct access” to the world 
(e.g., via sensors)? 

2. Intersubjectivity: How can the world view of several subjects (i.e., agents in the 
Internet of Things) be modeled coherently? 

3. Dynamics: How can the change of knowledge be modeled sufficiently? Which 
aspects of time can be represented? 


We considered the following theories of meaning: 


The model theory (extensional semantics) 
Modal logic (intensional semantics) 
Situation semantics 
Cognitive/distributional semantics. 


Poe Mur 


The single theories have the following advantages and disadvantages (see also 
Table 1): 


1. Model-theoretic semantics is the simplest model in our series of considered 
semantic theories. This semantic theory can be used to formulate sentences and 
their truth values. However, it does not provide us with techniques or formalisms 
for modeling reality to the highest degree (i.e., with its unstable and experiential 
nature). 


Table 1 Overview of how the challenges of perception, intersubjectivity, and dynamics are met by 
the various theories of semantics 


Model-theoret. | Possible world | Situation Conceptual/distrib. 
semantics semantics semantics semantics 
Perception v (indirectly, via vV 
sensor data 
processed by 
machine learning) 
Intersubjectivity vV vV v (communicating 


the dimensions) 


Dynamics v vV 
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2. Possible world semantics also does not provide us with an explicit connection to 
the real world that we could use to adequately model knowledge in the Internet 
of Things domain. However, it allows implementations for modeling temporal 
modalities and agent’s beliefs. As such, this semantic theory first lays theoretical 
foundations concerning dynamics and intersubjectivity. However, the founda- 
tional questions of how agents reach an agreement and how this fact about the 
agreement can be represented are not covered. 

3. Situation semantics can be considered another layer in the pyramid of knowledge 
representation formalisms, as raw sensor data, multiple viewpoints, and temporal- 
spatial dynamics can be represented to some degree. However, we believe that this 
formalism is also not the optimal semantic theory for Internet of Things scenarios, 
as it leaves too many questions unanswered, particularly concerning perception 
and intersubjectivity. 

4. Cognitive and distributional semantics can be judged in a manner similar to sit- 
uation semantics when it comes to Internet of Things applications. Compared to 
the previous semantic theories, cognitive and distributional semantics are rather 
empirical (i.e., data-driven theories). The introduction of different levels of cog- 
nition and the fact that symbolic knowledge representation can be connected to 
sub-symbolic knowledge representation is appealing, particularly when it comes 
to data gathered by sensors. Intersubjectivity can be reduced to empirical training 
using data and mathematical functions (encoded in the form of [neural] networks). 
We see the main lack of this semantic theory in the (elegant) modeling of knowl- 
edge change over time. 


Overall, we came to the conclusion that each of the semantic theories helps in 
modeling specific aspects, while not sufficiently covering all three aspects simulta- 
neously. For the future, working on the advancements of situational semantics and 
distributional semantics and combining them towards a united semantic theory can 
be very fruitful for developing future intelligent information systems. 
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A Qualitative Similarity Framework R) 
for the Interpretation of Natural genk 
Language Similarity Expressions 


Helmar Gust and Carla Umbach 


1 Introduction 


In this paper, a representational framework is presented featuring a qualitative notion 
of similarity. It is aimed at issues of natural language semantics, in particular the 
semantics of expressions of similarity and sameness and their role in comparison and 
ad-hoc kind formation.! Starting point was the interpretation of such expressions in 
German and English, for example so/such, ähnlich/similar, and gleich/same, which 
all denote similarity in some sense. It would be unsatisfactory, however, to treat 
similarity as a primitive predicate because semantic differences between individual 
similarity expressions would be obscured, for example, the fact that ähnlich/similar 
are gradable while so/such and gleich/same are not (see Umbach and Gust in print). 
Furthermore it would be difficult to establish the connection between similarity 
expressed by scalar and non-scalar equative comparison constructions, as shown 
in (1). 


'The notion of kinds in linguistics is closely connected to the notion of concepts in psychology 
(Carlson 2010). Moreover, ad-hoc categories formed by linguistic expressions show core char- 
acteristics of concepts (Barsalou 1983). We thus assume that kinds formed ad-hoc by similarity 
expressions closely correspond to concepts, see Umbach and Stolterfoht (in prep). 
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(1) a. Anna is as tall as Berta. scalar/adjectival 
b. Anna has a car like Berta’s. non-scalar/nominal 
C; Anna is dancing just like Berta. non-scalar/verbal 


Finally, a primitive similarity predicate would leave no room to account for the 
observation that certain similarity expressions, in certain contexts, can be used to 
form ad-hoc kinds. German so as well as English such combined with nominal 
expressions may refer to kinds (or concepts) instead of individuals. In (2a, b), for 
example, so ein Fahrzeug/such a vehicle does not refer to a particular vehicle but 
instead to an ad-hoc created kind of vehicles including the set of vehicles similar to 
the one the speaker points to. Umbach and Stolterfoht present experimental exidence 
that features licensing ad-hoc kinds must be principally connected to concepts, 
excluding factual and statistical properties (König and Umbach 2018; Umbach and 
Gust 2014; Umbach and Stolterfoht in prep.). Thus, a complex notion of similarity 
not only provides a detailed semantic interpretation of natural language similarity 
expressions—it opens a window into mechanisms of concept formation. 


(2) (Speaker points to an oversized car that makes enormous noise:) 
a. So ein Fahrzeug wird in den Innenstadten bald verboten sein. 
b. Such a vehicle will soon be banned in the inner cities. 


The framework in this paper offers a way to spell out the notion of similarity in 
some detail without being forced to leave the well-established ground of referential 
semantics. The core idea is to make use of attribute spaces representing complex 
features of individuals, and to make use of predicates defined on such features deter- 
mining the granularity of representation. In accordance with referential semantics 
we assume that natural language expressions refer to entities, or categories of enti- 
ties, in the real world. However, access is only indirect, mediated by generalized 
measure functions mapping real world entities to points in attribute spaces (this is 
called a mediated reference theory in Farber, Svetashova and Harth, this volume). 
Similarity is a key concept in our framework because it provides a variable notion 
of identity/indistinguishability with respect to a representation: Individuals count as 
similar if their features in a particular attribute space, given a particular granularity, 
cannot be distinguished. 

This system provides a powerful and flexible tool in the analysis of natural 
language semantics facilitating detailed interpretations of similarity expressions (so, 
such, similar etc.). Beyond, and maybe even more relevant, this system offers the 
possibility to analyze linguistic ad-hoc kind formation constructions, for example, 
by so/such demonstratives and equative comparison as in (1) and (2). It is important 
to realize, however, that this system is basically a multidimensional generalization of 
degree semantics (e.g., Kennedy 1999) complemented by a method for varying gran- 
ularity. From this point of view, our framework is anchored in referential semantics 
just as much as degree semantics is. 
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Attribute spaces are well-established methods of representation in AI’ and also 
in some branches of natural language semantics, e.g., in frame-based approaches 
(Barsalou 1992; Minsky 1975). What distinguishes attribute spaces and represen- 
tations as proposed in this paper from classical frame-based approaches is that we 
focus on systems of predicates on points in attribute spaces in contrast to the points 
in these spaces themselves, thereby introducing a qualitative aspect, for instance in 
modelling comparison. This idea is connected to the idea of micro-theories (see, 
e.g., in Cyc? or other ontology languages) which talk about small parts of the world 
covered, e.g., by a single concept like chair, vehicle, elephant, human, etc., but also 
about actions and events. We expect that such micro-theories provide some kind of 
prototypes or exemplars, positive and also negative ones. Maybe we just imaginate 
such exemplars. Here is a typical way how to introduce the concept of a physical 
object in a beginners lecture in experimental physics by imagination of a positive 
example*: “Think of a red steel ball of ten centimeters diameter in front of you. It 
need not to be red, it need not to be made from steel, it need not have a diameter of 
ten centimeters and it need not be a ball.” This shows that even abstract concepts can 
be characterized by exemplars (real or imaginated) together with the specification of 
relevant dimensions in an attribute space. 

This paper is structured in the following way: In Sect. 2 we develop a formal theory 
of representation making use of predicate systems over attribute spaces. Section 3 
gives a brief overview over the interpretation of natural language similarity expres- 
sions and the role of similarity in ad-hoc kind formation and equative comparison. 
Since the focus of this paper is on formal characteristics of the representational 
framework, we will not go into linguistic details.” In Sect. 4 we develop a formal 
similarity concept based on methods provided in Sect. 2. Section 5 shows how to 
use granularity and hierarchies of representations in order to model gradabilty along 
non-scalar dimensions. 


2 Representations in Multi-dimensional Attribute Spaces 


We start from the idea that natural language expressions refer to entities or categories 
(or even higher order structures, e.g., relations) of entities in the real world, but in an 
indirect way. Access to these entities or categories is mediated by a function we call 
generalized measure function, e.g., carı => {horse_power: 100 ps, weight: 1680 kg, 


?Starting from Minsky’s frames (Minsky 1975) and feature structures, up to modern approaches 
based on description logics (for an overview see https://en.wikipedia.org/wiki/Description_logic). 
3For micro-theories in Cyc see, e.g., https://pdfs.semanticscholar.org/4f28/6fdf9280449588b9d3 
78 1c9c897da28e0cff.pdf. 

4For an overview of the imagery debate see https://plato.stanford.edu/entries/mental-imagery/. 
Readers primarily interested in formal frameworks might skip Sect. 3. Readers primarily interested 
in semantics might want to start with Sect. 3 and eventually go back. 
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color: green ...}. This is related to what is called observables in physics®: Such a 
function assigns observable attributes (elements of an attribute space) to entities or 
classes of entities in the world.’ The referential power of language predicates like 
car (their meaning in the world) can thus be approximated by classifiers. Such clas- 
sifiers should be effectively computable characteristic functions of predicates.’ They 
operate on attribute spaces (or higher order structures based on attribute spaces).? 
Still, we can go back from predicates on points in attribute spaces to predicates on 
the entities in the world via the inverse image of the generalized measure functions. 
On the worldy side, a domain includes a set of relevant predicates P talking about 
entities in the world. According to the notion of a representation in this paper, these 
predicates have counterparts on the representational side marked by a star (*) in 
Fig. 1. Counterpart predicates are required to be consistent with their originals; more 
precisely, they have to agree in truth value on the set of positive and negative exemplars 
of the original predicate. Moreover, counterpart predicates will be assumed to have 
convex extensions. As a consequence, they must be true on all points in the convex 
closure of the images of the positive exemplars (see Fig. 1 below). In addition, we 
stipulate that the extensions of counterpart predicates must be open! in some given 
topology on attribute spaces. This ensures that small changes in the representation (in 
the sense of the given topology) do not change the truth-values of these predicates. 


©There is a long-standing debate about the dichotomy of observables vs. theoretical terms in philos- 
ophy, see https://plato.stanford.edu/entries/theoretical-terms-science/. We take a naive view here: 
observables are functions assigning values to entities in the world which can be determined by 
‘simple’ measurements. Examples are temperature, length, width, height, color, position, etc., in 
contrast to values for energy (which in case of heat, for example, depends on temperature, mass 
and specific heat of the matter). 


7Our approach is non-constructive since we do not construct representations, but instead have 
systems of constraints which representations must obey. Bechberger and Ktihnberger (this volume) 
discuss approaches for learning feature space representations by multidimensional scaling. They 
optimize these representations by using artificial neural networks. From our point of view, they try 
to learn a feature space F and a measure function jz from similarity and dissimilarity judgments of 
subjects. In this case, 4 maps stimuli (elements of a stimuli domain D) to points in F. 

Their approach is restricted such that all dimensions of F have a uniform structure. Essentially F 
is an euclidean vector space in their approach. There is no canonical interpretation of the dimensions 
found, and therefore, no link to natural language expressions. In a second step, the goal is to find 
classifiers which approximate meaningful subclasses of the stimuli space, which may then lead 
to interpretations of the dimensions. Bechberger and Ktihnberger discuss this as a quality measure 
suited in determining the number of dimensions of F. They generalize the approach to handle unseen 
stimuli. 
8Classification problems are common in artificial intelligence, where classifiers are trained on huge 
example sets to be able to classify unseen examples without error. Analogous to our approach, the 
first step is to find a suitable representation of the real world problems which can be handled by 
the classification algorithm. Then the example cases have to be translated into this representation 
in order for the classifier to be able to learn. 
°We may want to restrict computational complexity of classifiers since there should be efficient 
algorithms for classification. We will pay with accuracy to get easy to classify areas within the 
attribute space. 

'0Open sets are sets without a border. Think of a ball in three-dimensional Euclidean space as some- 
thing like a tomato: It has a crisp border. If we remove the border by peeling, it is unclear where the 
tomato ends. 
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Fig. 1 A domain of vehicles and a representation featuring positive and negative exemplars of small 
cars 


2.1 Domains and Representations 


We start the formalization of our approach by introducing domains and representa- 
tions. For classifiers, given the truth-value true, we get the extension in the attribute 
space by its inverse image of {true}, and we get its extension in the real world by 
applying the inverse image of the measure function. However, given a language pred- 
icate like small in the context of cars, its reference will in general not be completely 
determined by a classifier small* car and by subsequently applying the inverse image of 
the measure function. An entity which has all the attributes of a small car may not be a 
small car, and an entity which is a small car may not have all the attributes we in general 
assign to cars. In this sense, classifiers approximate the denotation of language pred- 
icates. This approximation relation is subject to consistency constraints: If we know 
that x is a small car and y is similar enough to x, we expect that y is a small car, too. 
What should ‘similar enough’ mean? In our approach, we can express this in terms of 
the attribute space: The attribute values must be similar enough. 

If the classifiers cannot discriminate between the representations (points in the 
attribute space) of two entities x and y, they must belong to the same concepts: If 
one is a small car, then the other must be a small car, too. In particular, this is the 
case if the representations in the attribute space are equal. Think of a situation where 
we measure size only with very low precision or specify color only by a few color 
values. If the above constraint is violated we should probably change our attribute 
space and/or our measure function, e.g., increase precision of measuring size and/or 
introduce a more fine-grained color specification. 

Often we have additional structure on our attribute space, e.g., a (pre)order relation. 
Assume that x and y are small cars, and z is in the car domain. The number of wheels 
are wx, Wy, Wz respectively; x, y, and z differ only in the number of wheels. Then, if w, 
< wz < Wy we expect z to be a small car, too. If not, we again have an inconsistency 
in our representation. And again, we probably should change it. The mathematical 
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foundation of this type of inconsistency is the theory of convex closures. The formal 
definition of a convex closure operator cl on a set X is the following (see Korte et al. 
1991): 

A function cl: (X) —> p(X) is a convex closure operator iff 


e it preserves the empty set cl({}) = {} 

e it is extensive A C cl(A) for all A C X 
e it is monotone A C B > cl(A) C cl(B) 
e itis idempotent cl(cl(A)) = cl(A) 


e the anti-exchange property holds x,y ¢ X, x Æ y, 
xeclX U {y}) > y€écl(X U {x}) 


In the two-dimensional Euclidean plane, we can visualize the effect of a convex 
closure operator. Suppose X is cl({a, b, c}). If x is in cl(X U {y}), then y cannot be 
in cl(X U {x}). The anti-exchange property ensures convexity. In a two-dimensional 
Euclidean plane, this means that for any two points in X the connecting line must 
also be in X (Fig. 2). 

On a (partially) ordered set (M, <) we can define convex closure operators in a 
natural way (see Fig. 3). For A C M we define: 


left closure: cl_(A) = {x EM|lAyeA:x<y} 
right closure: cl(A)= {xe MlayeA:y<x} 
convex closure: cl(A) = {xe MldAy,zeA:y<x<z} 


To sum up: We approximate the meaning of natural language predicates by classifiers 
and their inverse images by means of a generalized measure function. Additionally, 
we request that classifiers respect some consistency constraints: (1) they should clas- 
sify known examples correctly, (ii) their extension (as a subset of the attribute space) 
should be convex according to a suitable convex closure operator and (iii) their exten- 
sions should be open in a suitable topology. The topology and the closure operator 
must be compatible: Closures of open sets must be open. 


Fig. 2, Convex closure and y 
anti-exchange property in the 
Euclidean plane 


a 
— 
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Fig. 3 A non-convex set in 
the two-dimensional plane 
and its convex closure 


First, we need a notation to refer to the entities we are talking about by a natural 
language predicate like small car: the set of entities (in the world) for which it 
makes sense to ask if they have car properties, that is, entities for which the attribute 
dimensions for cars make sense, e.g., number of wheels, horsepower, size, weight, 
color etc. We exclude entities for which it does not make sense to ask if they have 
car properties, e.g., single atoms, trees, hens etc. 

Next, we assume that we have clear cases: positive examples such as entities which 
are definitely cars, and negative examples such as entities for which the attribute 
dimensions of cars make sense but which are definitely not cars, e.g., motorbikes. 
Concepts which are related and belong to the same micro-theory are collected as 
predicates over the same domain. Think of different types of cars, bikes, trikes etc. 

We assume that there is a universe U which includes all the entities in the world. 
We can start now formalizing our approach by defining a domain as a subset of the 
universe U together with a set of predicates and non-overlapping sets of positive and 
negative examples for each predicate. 


Definition 1 Domain 
A domain D is a quadruple (D, _*, _~, P) with: 


“DCU a set of individuals/entities (called the carrier of the domain), 

e P={pi,... Pn} a set of identifiers of predicates over D, 

e _p: P — (D) the extension in D of a predicate!! denoted by index p!? 

e +*+: P — (D) a function which assigns a set of positive examples to each 
predicate (for _* (p) we write p*), 

e _: P — (U) a function which assigns a set of negative examples to each 


predicate!’ (for __~ (p) we write p`), 
e Yp € P: pp* N pp™ = (consistency), 
e 3q €e PYp €P: ppt C qt App. © qp* ^qp` N D = G (universal predicate). 


lIn fact, we will often use characteristic functions in place of predicates. In the structures we are 
interested in, there is an isomorphism between (D) and QP. We will not restrict ourselves to a 
special type of logic (e.g. two-valued classical logic). We stipulate a logical system characterized 
by a set of truth-values Q. Q = {true, false} for classical logic, Q = [0, 1] for fuzzy logic. 

!2We will drop the index D whenever it is clear which domain we are talking about. 


'3Positive examples must be in the domain, negative examples may be anywhere. A small mouse 
is a negative example for ‘big elephant’, but a small elephant is a more informative example. 
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2.2 Representations and Classifier Systems 


We view the elements of D as entities to which we have only indirect access via a 
(generalized) measure function u. The measure function jz constructs representations 
of the entities in D as points in an attribute space F, much like observables in physics. 
Attribute spaces are well-established representational structures.'* They generalize 
vector space approaches in allowing heterogeneous dimensions equipped with value 
sets of different scales (nominal, ordinal, interval, proportional, partially ordered 
etc.), where value sets may themselves be attribute spaces with multiple dimensions. 

An attribute space F is given by a set of attributes A = {a , ..., an }, such that for 
each a; in A there is a set of possible values V,; of a;. Elements of D are mapped to 
points in Vz; X --- X Van, the carrier of the attribute space F. Think, for example, 
of number of wheels as an attribute with {1, 2, 3, 4, 5, 6, ...} as its value set, or 
horsepower as an attribute with the positive real numbers as its value set.!> 

A representation includes an attribute space F, a (generalized) measure function 
u mapping elements of a domain into the attribute space, and a set of classification 
functions p* applying to points in the attribute space. In the case of the attribute 
number of wheels the measure function jz just has to count. In the case of the attribute 
horsepower a complex measurement procedure is required to determine the value 
of u. The classification functions (short classifiers) serve as approximations'® of 
the predicates in P.'’ Moreover, the extensions of the classifiers will be assumed 
to be open and convex. This means that F comes with a convex closure operator cl 
and p* must be true on cl(u(p*)).! Using the n-dimensional Euclidean space as an 
example, the extensions of the classifiers must not have holes, notches or coves in 
the representation space F. 


Definition 2 Representation 

A representation F = ((F, cl), u, _*, D) of a domain D = (D, _*, _~, P) is given 
by 
© an attribute space F together with a closure operator cl and a compatible topology 


(we write F for (F, cl) if we are not interested in the closure operator cl), 
e a measure function!” wD>F, 


4 Attribute spaces are related to the classical frame approaches (Minsky 1975). Other related 
approaches are feature structures which are widely used in linguistic formalisms (Carpenter 1992). 


5Note that ordinal or metric dimensions as common in degree semantics correspond to one- 
dimensional attribute spaces in our approach. 


6More precisely: p* o u approximates p. 
TFor every p € P there is a p* € P*. 


8This includes all points in the convex closure of the images of the positive exemplars. For the 
concept of convexity in conceptual structures see Gardenfors (2000). Intuitively, the convex closure 
of a subset X of F is the smallest convex subset of F containing X. 


°In most cases, we do not expect to explicitly compute values of the measure function for entities 
in D. Almost no one will be able to compute the horse power of his car. To learn about the horse 
power of my car I would look-up the value in the data sheet. When you go to the doctor for a general 
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Fig. 4 Domains and representations 


e a function _*: P > QF 
(we write p* for _*(p) and call them classifiers).7° 


Representations are subject to three consistency constraints: 


e Yp €P the extension of p* must be open and convex in (F, cl) 
e Vp €P Yx € p*: p*(u(x)) = true 
© Vp ePVxEep OD: p*(u(x)) = false 


From this we get u(pi*) O up N D) = Ø (Fig. 4). 


As mentioned above, attribute spaces are familiar methods of representation. What 
distinguishes attribute spaces from the representations proposed in this paper is the 
idea of classifiers on attribute spaces. On the worldy side, a domain includes a set of 
relevant predicates p € P. On the representational side, these predicates have counter- 
parts, namely classifiers p* € P*. By P* we denote the set of all basic classifiers: P* 
= {p* |p € P}. These classification functions are required to be consistent with their 
corresponding predicates over D; more precisely, for the set of positive/negative 
exemplars the truth-values of the classification functions have to agree with the 
truth-values of the original predicates (see Definition 2). 

Given a set of basic classifiers,”! we assume the possibility to construct derived 
classifiers by logical operations: For the logical conjunction this is unproblematic 


health check-up the chance that she will take a measure stick to measure your height is very small. 
It might instead be like this: doctor: “How tall are you?”, patient: “As tall as you.”, doctor: “About 
1.75?”, patient: “Think so.” Nevertheless, it should at least in principle be possible to determine 
the value for a given element in D. It is even possible to use of machine learning technics to learn 
suitable dimensions and values by analyzing similarity judgments of subjects (see footnote 7). 
20Where Q% is the set of characteristic functions F —> Q. In addition, we expect that classification 
functions come with algorithmic methods to compute these functions. 

?! There is an interaction between the attribute space F and the measure function u. While attribute 
spaces can provide highly structured representations, classifiers can be viewed as attributes with 
values in Q. It is possible to hide all the complex structure of a representation in the measure function 
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(convex sets and open sets are closed under intersection). For the logical disjunction 
we have to apply the convex closure operator c/ to the result. For negation this is not 
possible. Thus we do not allow to define complex classifiers by applying negation to 
elementary ones.”* We name the set of derived classifiers Pe 


Definition 3 Classifier systems 
Given a set of basic classifiers B over an attribute space F, we define a set of 
classifiers B inductively (much like a topology): 


*BCB we expect that elements of B are convex and 
open, 
XE B, YeBoxXxnyYeB intersections, 


° XeEB,YEBocl(XUY)EB closures of unions. 


If F is (partially) ordered: 


*XeBocl_,(X)e B cl_,(X)= {xe Flay €X:y <x} right closures, 
e XeB—>cl(X)eB cl_(X)= {xe FlayeX:x <y} left closures. 


It is important to mention that in general B is not closed under complement. This 
means that we do not have negation: Complements of convex sets need not be convex 
and complements of open sets need not be open. We start with basic classifiers B = 
p* = {p1*, ..., Pn*} and get P* as the corresponding system of classifiers. 


3 Similarity Expressions in Natural Language 


In this section, a brief overview will be given of the challenges involved in the 
interpretation of similarity expressions. This section will not give a full description 
of the semantic phenomena—teferences will be given for details—but instead serve 
as a motivation for the specifics of the similarity framework presented in this paper. 


3.1 Similarity Demonstratives 


The need for a framework that models similarity originated from the problem of how 
to interpret the German demonstrative so (‘so’/‘such’). It is a genuine demonstrative 


by using (p1* x --- X Ppn*)oy2 as new measure function and Q” as attribute space F. Of course that 
is not the idea of this approach. We will try to use ‘simple’ measure functions and meaningful 
attribute dimensions. 

22Tn general, complements of concepts are not necessarily themselves concepts—a non-car is not a 
proper concept. 
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expression, so we expect direct reference in the sense of Kaplan (1989). It does not, 
however, express identity as does, e.g., dies/this, and instead it refers to a set of 
entities which are in some sense similar to the target of the demonstration gesture 
(the entity the speaker points to). If the speaker points to a car while uttering “So ein 
Auto hat Anna” (‘Anna has a car like this’), Anna’s car is said to be, with respect to 
a particular set of features, indistinguishable from the car the speaker points to. This 
kind of demonstrative expressions is called similarity demonstratives in Umbach and 
Gust (2014), Gust and Umbach (2015), and demonstratives of manner, quality and 
degree in Konig and Umbach (2018). 

We follow Nunberg’s (1993, 2004) adaptation of the Kaplanian analysis, inter- 
preting demonstratives as directly referential expressions, but at the same time 
dismissing the idea that the target of the demonstration is necessarily identical to 
the referent of the demonstrative. This allows for a straightforward interpretation of 
similarity demonstratives such that the target of the demonstration is the individual 
or event the speaker points to, and the referent of the demonstrative phrase is related 
to the target by similarity instead of identity. Similarity is then implemented by 
indistinguishability of points in attribute spaces (see Sect. 4). This implementation 
of similarity is in fact close to the idea of contextual granularization suggested in 
Nunberg (2004): When restricting attention to a particular set of features, it may 
be the case that two entities can no longer be distinguished. It is important to note, 
however, that this idea requires a framework that distinguishes between a referential 
and a representational level—you cannot speak about indistinguishability without 
access to what could have been distinguished. 


3.2 Ad-Hoc Kinds 


According to the similarity analysis, demonstratives like German so and English 
such create classes of similar items, e.g. similar cars. There is some evidence that 
in the nominal and verbal case (though not in the adjectival case) these similarity 
classes constitute ad-hoc kinds. In a nut-shell, so/such phrases can be shown to be 
restricted to particular features of comparison. For example, the feature number of 
doors would be perfect when comparing cars but not when comparing mugs—mugs 
do not have doors, so the number of doors does not qualify as a feature of comparison 
for mugs. But mugs as well as cars can be recently purchased and nevertheless being 
recently purchased does not qualify as a feature of comparison for neither cars nor 
mugs. This suggests that properties qualifying as features of comparison must not 
be accidental. 

There is experimental evidence that features of comparison are restricted to prop- 
erties which are neither accidental nor evaluative (see Konig and Umbach 2018; 
Umbach and Stolterfoht in prep.). This raises the question of how to characterize 
these properties, which is a prominent issue in the debate about concept formation 
in cognitive psychology. Only recently has this debate been connected to the topic 
of genericity in linguistics by Greenberg (2003) and Carlson (2010), and by the 
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experimental studies in Prasada and Dillingham (2006) and Prasada et al. (2013), 
providing evidence that there are so-called principled connections between kinds 
and properties that an entity has, because it is the kind of thing it is. 

There is an alternative analysis claiming that demonstratives like German so and 
English such are pro-kind expressions (see Anderson and Morzycki 2015, adapting 
Carlson’s 1980 kind-referring analysis of such). The final results of the two accounts 
are fairly close. However, unlike the pro-kind account, the similarity account not just 
postulates that so/such phrases denote kinds, but in addition shows how these kinds 
emerge, namely by similarity. 


3.3  Equative Comparison 


Another phenomenon where similarity plays a significant role is equative 
comparison, including non-scalar as well as scalar cases, see (3a-c).”? In German, 
scalar as well as non-scalar equatives are uniformly constructed by so ... wie where 
so is a correlative pronoun relating to the standard of comparison given in the wie 
clause: 


(3) a. Anna ist so groß wie Berta scalar/adjectival 
Anna is as tall as Berta 
b. Anna hat so ein Auto wie Berta non-scalar/nominal 
Anna has a car like Berta’s 
c. Anna tanzt so wie Berta non-scalar/verbal 


Anna is dancing just like Berta 


Given that the demonstrative so can in general be substituted by wie dies (‘like this’), 
it suggests itself to analyze wie as expressing similarity as does so, though without 
a deictic component. This allows for a generalized account of equative comparison: 
The nominal equative in (3b) is interpreted such that Anna’s car is similar to Berta’s 
car with respect to a set of contextually given features; the verbal case in (3c) is 
interpreted such that the event of Anna dancing is similar to the event of Berta 


31t has been argued that (3a) and (3b, c) just differ in being one-dimensional as opposed to 
multi-dimensional, and that even multi-dimensional comparison is scalar. There are, in fact, multi- 
dimensional adjectives like healthy that allow for comparatives: A is more healthy than B. Sassoon 
(2013) suggests to interpret comparatives of multi-dimensional adjectives by quantification over 
dimensions in which the compared entities exceed the standard: A is more healthy than B iff the 
number of dimensions in which A exceeds the standard is greater than that of B exceeding the 
standard (for alternatives see the subsection on gradability below). 

This approach presupposes, however, that the individual dimensions are scalar, which is not 
generally the case, consider, e.g., color as a dimension in comparing cars or posture as a dimension 
in comparing dancing habits. Moreover, even though cars and dancing habits can be compared in 
equatives, forming comparatives is impossible. This is strong evidence that (3b,c) are genuinely 
non-scalar. 
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dancing; and the adjectival case in (3a) is interpreted such that Anna is similar to 
Berta with respect to their height—note that the scalar equative in (3a) does not hinge 
on contextually given features of comparison but instead ‘carries its dimension on 
its sleeves’. 


3.4 ‘Exactly’ Versus ‘At-Least’ Reading 


Scalar equatives like (3a) allow for two readings. On the exactly reading, Anna’s 
height is (approximately) the same as Berta’s height, while on the at-least reading 
Anna’s height is greater than or equal to Berta’s height. While both readings are 
attested in the data, standard degree semantics and the similarity analysis differ with 
respect to which reading is predicted to be primary. In standard degree semantics 
equatives are assumed to have an at-least interpretation as their meaning while the 
exactly reading is derived by scalar implicature. In the similarity analysis, on the 
other hand, equatives (scalar as well as non-scalar) are interpreted such that their 
meaning is symmetric, since similarity is an equivalence relation—A ist so groß wie 
B means that A is similar in height to B—thereby raising the question of how to 
account for the at-least reading. 

The question of which of the exactly and the at-least reading is basic has been the 
topic of a continuous debate when addressing numeral expressions. According to the 
classic analysis by Horn (1972), sentences containing numbers assert lower bound- 
edness and may, depending on the context, implicate upper boundedness—Anna has 
three sheep asserts that she has at least three sheep and implicates, depending on 
context, that she has at most three sheep. This analysis has been questioned, for 
example, by Kennedy (2013) who presents, among other things, scope effects that 
cannot be explained in the classic analysis. Surprisingly, this debate has not been 
extended to equative constructions, even though according to the classic analysis 
degree equatives assert at-least interpretations, as in the case of Horn’s analysis of 
numerals: Anna is as tall as Berta is true if height (Anna) > height (Berta) (see, e.g., 
Kennedy 1999). 

We assume that the semantics of scalar equatives is given by similarity even in 
contexts requiring an at-least reading, and we implement this idea by exploiting the 
granularity encoded in our framework. Consider the example in (4). In this context, 
Sophie tells the truth even if she is taller than Larissa. In general, if there is a threshold 
given in the context, it appears irrelevant by how much it is exceeded. 


(4) Sophie wants to join the police, which requires a certain minimum height. Her 
cousin Larissa has told their grandma that she has already been accepted by 
the police. That’s why grandma asks Sophie whether she is as tall as Larissa. 
Sophie replies: Ja, ich bin so groß wie Larissa/Yes, I’m as tall as Larissa. 
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In the case of at-least readings, classifiers applying to the standard of comparison, 
e.g., Larissa’s height in (4), are mapped to their right closure.” Thereby Sophie 
counts as similar in height to Larissa even if she is ten centimeters taller. Thus 
our account is “mildly ambiguous”’—in particular contexts, closures involved in 
determining similarity are adjusted. It has to be noted, though, that this adjustment is 
licit only if the difference is moderate. But if, for example, Larissa is a six-year-old 
and Sophie is her mother, it would be absurd to assert that Sophie is as tall as Larissa 
(which is predicted to be true on the classical analysis of degree equatives). 

For negated scalar equatives the prominent reading is asymmetrical: The sentence 
Anna ist nicht so groß wie Berta/Anna is not as tall as Berta. is preferably interpreted 
such that Anna is smaller than Berta. This asymmetry is not influenced by the exis- 
tence of a contextual threshold and does not appear infelicitous in the case of major 
differences—Larissa is not as tall as Sophie would be acceptable even if Sophie is 
Larissa’s mother. The preference for the asymmetric reading of negated scalar equa- 
tives can be explained by the fact that a disjunctive (symmetric) reading according 
to which Anna is either smaller or taller than Berta would not be convex any longer. 
Given that convexity plays a primary role in cognitive economy it is hardly surprising 
to find such effects in natural language semantics (see also Solt and Waldon 2019 on 
numerals under negation). 


3.5 Gradability 


Implementing similarity as indistinguishability (see the next section) suggests that 
it is a nongradable concept. This is plausible considering expressions like German 
so/wie and English such/like. On the other hand, the adjectives ähnlich and similar 
are gradable—Anna can be more similar to her father than to her mother. This points 
to the need for a gradable notion of similarity. 

Cognitive Science models of similarity usually start out either from a notion of 
distance in a geometrical space (e.g. Gärdenfors 2000) or from numbers of common 
and distinctive features (e.g. Tversky 1977). Both approaches facilitate a straight- 
forward definition of the comparative: In geometric models similarity increases if 
distances decreases, and in feature based models similarity increases if the number 
of common features increases and that of distinctive features decreases. However, 
the positive form—the predicate similar—would require a threshold from where on 
two items count as similar, which would be hard to provide in a non ad-hoc fashion. 

In our system, the positive form is the primary one—two items are similar if 
indistinguishable with respect to a given representation (including dimensions of 
comparison and classifiers, see Definitions 2, 4 and 5). The comparative will be 
defined making use of representations of different granularity: Two items a and b 
are more similar than two items c and d in a representation F if and only if there is 


°4See the quasi exactly implementation of the at-least reading by right closure of classifiers in 
Sect. 4. 
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a less granular representation F’ such that a and b are similar in F’ while c and d 
are not (see Definition 8 in Sect. 5). Suppose, for example, that in representation F 
neither a and b nor c and d are similar. If there is a less granular representation F’ 
such that a and b are similar while c and d can still be distinguished, then a and b 
must be closer in terms of properties than c and d. 

Defining a comparative notion more similar based on the positive form similar is 
reminiscent of the vague-predicate approach suggested by Klein (1980). In contrast 
to the standard degree-semantic approach where degrees are compared in interpreting 
the comparative—Anna is taller than Berta is true if her degree of height exceeds that 
of Berta—in a Kleinian approach the comparative is modelled by varying contexts, 
that is, varying thresholds for the positive predicate to apply: Anna is taller than Berta 
is true if there is a context such that Anna counts as tall while Berta does not.” This 
way of interpreting the comparative is, first of all, consistent with cross-linguistic 
findings showing that the majority of languages express the comparative in terms of 
the positive. Moreover, it does not rely on the existence of a single scale of degrees. 

The definition of more similar suggested above gives us the means to interpret 
the comparative form of the adjective similar. But beyond that it allows a Kleinian 
style definition of comparatives for multi-dimensional adjectives like healthy and 
beautiful. Comparatives of multi-dimensional adjectives are usually interpreted using 
degree semantics, either by counting dimensions in which the threshold is exceeded 
(see Sassoon 2013), or by integrating dimensions such that the result forms an order, 
where integration may be context-dependent and also judge-dependent (see Solt 
2016). 

The similarity framework puts us in the comfortable position of not having to 
treat all adjectives in the same way. Adjectives like tall and old, which clearly refer 
to a single ordinal or even metric scale, will be interpreted via a single dimension. In 
this case, similarity takes the role of specifying the granularity of this scale: Anna is 
taller than Berta is true if all points of the granule of Anna’s height are greater than 
all points of the granule of Berta’s height (in the case of overlapping the situation 
is more complex). Multi-dimensional adjectives like healthy and beautiful, on the 
other hand, will be interpreted by similarity to a prototype*°: Anna is healthy is true 
if Anna’s health is similar to the prototype. And Anna is more healthy than Berta is 
true if Anna’s health is more similar to the prototype than Berta’s health. 


4 Indiscernability 


In order to realize that two entities in the world are different their representations 
must differ in some way. This means that they must be recognizably different. In 
our approach this means that there are classifiers which can discriminate them. The 


>Contexts have to be consistent with the order of individuals in the domain. 
26 Analogous to thresholds in a single dimension—context-dependent and maybe judge-dependent. 
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complementary situation is indistinguishability, which means that, on the represen- 
tational level, we cannot discriminate them. In our approach, given a system of pred- 
icates P there are two reasons why we may not be able to distinguish two elements 
of D: 


e Two elements may lead to the same value of the function u, i.e., the same point in 
the attribute space. Then no classifier can discriminate between the two elements. 

e The two elements disagree on u (so we see that they are different), but they agree 
on all classifiers in P*. 


To account for these types of indistinguishability we borrow the term indiscernible 
from Rough Set Theory (Pawlak 1998): 


Definition 4 [ndiscernible 
Given a representation F = (F , u, _*, (D, _*, _~, P)) we define: 
For x,y E€ F: x ~F y = Yq € P*: q(x) << qv) 


where P* is the set of all derived classifiers. 


According to this definition, indiscernibility is relative to the classifiers in P* in 
a representation F. The relation of indiscernibility talks about points in F. However, 
the similarity relation we are interested in talks about elements of the domain D. 
Therefore, we have to apply the measure function before checking indiscernability. 
This gives us a first simple similarity relation: 


Definition 5 Similar 
Vx, y € D: sim(x, y, F) = ux) ~F uO) 


Obviously, Definition 5 defines an equivalence relation on D and we get a partition 
of the domain. The indiscernibility relation provides attribute spaces with a level of 
granularity, facilitating comparison of attribute spaces of distinct granularity which 
are otherwise identical. Let [y] denote the equivalence class (similarity class) of y: 
[y] = {x|x ~F y}. In Rough Set theory, such equivalence classes are called granules. 

There is a problem with this definition of similarity: The similarity classes in the 
attribute space may not be convex, as the following example shows. Think of case 
(3a) Anna ist so grof wie Berta (‘Anna is as tall as Berta.’). Assume that we have a 
dimension of height (measured in meter) in the attribute space and classifiers which 
specify height with some granularity depending on the measured value: A height of 
1.80 is given by some value between 1.78 and 1.82, while a height of 1.81 is given 
by some value between 1.806 and 1.814, and so on. Therefore, we may not be able 
to discriminate between 1.80 and 1.815: both belong to the same granule [1.80]. 
Nevertheless, we can discriminate between 1.80 and 1.81 since we have a classifier 
[1.81] giving true on 1.81 and false on 1.80. Therefore, the granule of Berta’s height 
(Ly] in Fig. 5, which is equal to [1.80]) may be not convex because [1.81] forms a 
hole. This results in the following situation: If Berta’s height is 1.80, then Anna’s 
height may be 1.80 or 1.815 but not 1.81 in order for the sentence to be true (as 
demonstrated in Fig. 5). This is counterintuitive. 
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HEIGHT 


[y] = [1.80] 


Fig. 5 Granules with holes 


We can solve this problem by introducing a new parameter in the definition of the 
similarity relation: similarity relative to a point of reference. This point of reference 
determines the granules to be selected. 

Definition 6 Similarity relative to a point of reference 

Given a representation F = ((F,cl), u, _*, (D, _*, ~, P)), we can define a 
similarity relation relative to a point of reference r in two different ways: 

Vx, ye Fix ~F, y 


(a) iff Yq € P* : g(r) > q0) Aq) 
(b) iff Vq € P* : g(r) > qo) > q0) 


Definition 6a means that principal filters” of x and y in P* contain the principal 
filter of r. In contrast, Definition 6b means that elements of the principal filter of r 
in P* cannot discriminate between x and y. It is easy to see that (a) => (b), but not 
(b) > (a). 

For an intuitive insight into the functionality of this type of similarity relation, 
have a look at the Venn diagrams in Fig. 6 and at Table 1: 

Assume that there are four classifiers in P*: small”, big*, normal* (concerning 
size), and heavy* (concerning weight). Table 1 shows some possible classifications 
of x, y, and r. These possibilities correspond to the dashed sets in Fig. 6. The last two 
columns show the truth-values of the two similarity relations (a) and (b) in Definition 
6 for the different cases. All the other cases can be handled by symmetry; only heavy* 
varies. The interesting case is line (2) since the two similarity relations differ: If y is 
small but r and x are not, and x is big but r and y are not, and x and y are normal but r 
is not, and r is heavy but x and y are not, then similarity of x and y with respect to the 
reference point r is true according to Definition 6b but false according to Definition 
6a. Intuitively, if the properties of the reference point r differ substantially from the 
properties of x and y then Definition 6a gives false while 6b gives true. We consider 
Definition 6a more plausible than 6b. 

For given F and r the relation x ~F, y is a (kind of local) equivalence relation. If 
we switch the reference r, the classes will obviously change. If we choose one of the 
arguments as point of reference, we get an asymmetric similarity relation: In general 
x ~Fy y will be different from y ~ Ffy x because the point of reference changes. 


27The principal filter of x is {q € P*| q(x)}. 
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Table 1 Similarity of two points x and y in the attribute space with respect to a reference point 
r depending on the possible extensions of the predicates small*, big*, normal*, and heavy*. The 
cell [small*, (1)], for example, indicates that small*(r) is false, small*(x) is false and small*(y) is 
true. The cell [heavy*, (1)] indicates that heavy*(r) is false, heavy*(x) is either true or false, and 
heavy*(y) is also either true or false 


small” big“ normal” heavy“ XNFrY 

rxy rxy rxy rxy (a) (b) 
(1) alai true true 
(2) false true 
(3) false false 
(4) false false 
(5) true true 

If x ~, y holds, 


classifiers of the 
dashed sets cannot 


occur in P* 


(b) WgEP*: g(r) > (q(x) e q(y)) 
=> wgeP*: g(r) > (q(x) > gy) 


(a) vqeP* : qr) > q(x) Aq) 
—vqEP*: g(r) > q(x) A qv) 


—3gEP*: g(r) A (g(x) V ~q(y)) =3qEP*: g(r) A ~ll) e q) 


Fig. 6 If dashed sets occur in P*, x and y cannot be similar 


Definition 7 Similarity classes 
For given F and r we define the similarity class of r as 


(a) [rls = {x1Yq € P*: qr) > qœ)}. 


For [r] ~ we borrow the term granule from Rough Set theory. Again we can use the 
inverse image of the measure function to define similarity relations on the domain. 
For a, b € D we define two different similarity relations. The one in (b) makes use 
of a point of reference r that is independent of either a or b, whereas in (c) the point 
of reference is identical to the second argument: 


(b) sim, (a, b, F) iff u(a) ~r, u(b) (+transitive, + symmetric, —reflexive) 
(c) sim (a,b, F) iff w(a) ~¢p(b) (—transitive, —symmetric, +reflexive) ** 


If we again look at our example (3a) Anna ist so grof wie Berta (‘Anna is as tall 
as Berta.’) we see that the granules depend on the point of reference r (Fig. 7). If we 
use sim’ from Definition (7c), there are two possible situations. In the first situation, 
we get the information that the height of Berta is 1.80. Since Berta provides the 
reference point (Definition 7c) the relevant granule is [1.80]. The height of Anna can 


28 sim’ uses the second argument as point of reference. 
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1.79 1.80 cst SS Anna is as tall as Berta. true 


HEIGHT Anna is as tall as Berta. false 


Fig. 7 The effect of holes 


be an arbitrary value in this granule to make the statement true. It maybe 1.80 or 
1.81—we simply cannot discriminate between both cases because the granule [1.80] 
is convex (no holes). In the second situation, we get the information that the height 
of Berta is 1.81. Now the relevant granule is [1.81] and not [1.80] even though 1.81 
may be an element of [1.80]. The height of Anna is restricted to the relevant granule: 
1.80 is not a possible value any longer, it falsifies the statement. Although it seems 
that there is a hole in [1.80] in the second case, in both cases, the relevant granule is 
convex. 


4.1 (A)symmetry of Similarity 


The notion of similarity relative to a reference point is reminiscent of the question of 
whether the predicate similar is symmetrical addressed by Tversky (1977) and also 
Gleitman et al. (1996). 

Tversky’s seminal paper on feature-based similarity starts with empirical obser- 
vations indicating problems of the then predominant geometric notion of similarity 
and the basic axioms of metric distance”: (i) minimality is problematic in view of 
results concerning the identification probability for identical stimuli, (ii) symmetry 
is apparently false—the judged similarity of North Korea to Red China exceeds 
the judged similarity of Red China to North Korea—and (iii) triangle inequality is 
hardly compelling—Jamaica is similar to Cuba (geographical proximity) and Cuba 
is similar to Russia (political affinity) but Jamaica and Russia are not similar at all. 

However, a closer look reveals that these findings are not generally valid. 
Before dismissing transitivity of the similarity relation on the basis of the 
Jamaica/Cuba/Russia example, one should consider the role of switching features 
within the two comparison steps.°? And before dismissing symmetry, which is 


29 A metric distance function ô has to comply with (i) minimality: 5(a, b) > 5(a, a) = 0, (ii) symmetry: 
5(a, b) = (b, a) and (iii) triangle inequality: 5(a, b) + 5(b, c) > ê(a, c). 
30 sim! (Definition 7c) is in fact intransitive due to using the second argument as point of reference. 
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frequently done in the Cognitive Science literature, one should consider the study in 
Gleitman et al. (1996) and, first of all, Tversky’s original study. 

In Tversky’s study, the linguistic presentation was directional (North Korea is 
similar to Red China), and he himself argues that the asymmetry finding hinges on 
the directional way of presentation. If the task is to assess the degree to which A is 
similar to B, then features of A may weigh more heavily than those of B.*!** But if 
the task is to assess the degree to which A and B are similar to each other, weights 
are expected to be equal and similarity judgements are symmetric. In Gleitman et al. 
(1996) the influence of directional vs. nondirectional presentation is experimentally 
examined for a number of predicates that are intuitively thought to be symmetrical 
including similar, equal and identical. The authors find that the way of presentation 
is decisive for the (a)symmetry in the interpretation of these predicates, even if the 
nouns they are combined with are nonsense nouns. 

Tversky as well as Gleitman et al. attribute the asymmetry effects triggered by 
directional presentation to the difference between Figure and Ground. The same 
idea is found in our second definition of relative similarity (Definition 7c), where the 
second argument takes the role of the Ground in determining the relevant granule. 


4.2 ‘Exacly’ Reading Versus ‘At-Least’ Reading 


As shown in Sect. 3, scalar equatives may have two readings: an exactly reading and 
an at-least reading—Anna is as tall as Berta may be interpreted such that Anna’s 
height is the same as Berta’s height or such that Anna’s height exceeds Berta’s height. 
We assume that the semantics of scalar equatives is uniformly given by similarity even 
in contexts requiring an at-least reading, and we implement this idea by exploiting 
the granularity provided by closures on classifier systems. 

The exactly reading of equatives is accounted for by the granules defined by 
the available classifiers and the reference point (Berta). (Anna) must be in the 
granule of (Berta). To account for the at-least reading we need a transformation of 
classifiers such that all degrees above a certain point x count as similar.** Formally, 
we define a mapping from the classifier set P* to a subset P* such that every p* 
in P* that classifies a member of cl_,([r]¥) as true is mapped to its right closure 
while the others stay unchanged. Figure 8 shows such a mapping: All classifiers left 


31n Tversky’s contrast model a function S takes weighted sums of the feature sets A and B of objects 
a and b to an interval scale such that sim(a, b) < sim(c, d) iff S(a, b) < S(c, d), where S(a, b) = 0f (A 
N B)-af(A — B)- Bf(B — A), a, B, 0 denote weighting functions and f denotes a nonnegative 
scale. 


32 There is also the issue of which features are activated in the first place. In a directional presentation 
the subject will determine which features are relevant in comparison. 

331f we have a simple interval scale, we can model the at-least reading directly by the order of the 
attribute values. If we want to model granularity in addition, it becomes more complex since granules 
may overlap. If the scale is weaker or multiple dimensions are involved, comparison becomes even 
more complex. Our approach provides a uniform framework for all these cases. 
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HEIGHT 


p(Berta) = } p(Anna) 


Fig. 8 Quasi-exactly implementation, one dimension 


of [r] stay unchanged, while all classifiers to the right of [r] will be mapped to [r]. 
If the classifier extensions overlap, the situation may be quite complex. The right 
closure of [r] handles the general case. This procedure makes it possible to derive 
the at-least reading from the exactly reading by solely adapting classifiers. We call 
it a quasi-exactly implementation of the at-least reading: 


Quasi-exactly implementation of the at-least reading by right closure of classifiers: 
P* = {p,* | for p* € P* if p* N cl_,([r]¢) Æ Ø then p,* = cl_, (cl(p* U [r] F)) 
else p, * = p*}. 


Although we get an at-least reading, the result still defines an equivalence class**: If 
we select a granule by a point of reference, every element in the granule is equivalent 
to every other element in the granule. This approach can handle multi-dimensional 
cases, too. Assume that we are talking about the size of tables represented by dimen- 
sions length and width, and we use the classical convex closure of the Euclidean 
two-dimensional space. For non-overlapping classifiers the following two situations 
may occur (Fig. 9a, b). If the extension of a classifier p* is outside cl_, ([r]), then p* 
stays unchanged. If it is inside, then p* will be mapped to c/_, ([r]), analogous to the 
one-dimensional case. The general case with overlapping classifiers is again covered 
by the formulas in Fig. 9a, b. 

It is essential in our approach that the exactly interpretation is the primary one 
and is specified by the granularity given by the (contextually determined) classifier 
system P*. The at-least interpretation is derived by applying a transformation to the 
classifier system P* depending on the reference element r. 


5 Granularity of Representations and Gradability 
of Similarity 


As stated in Sect. 3, granularity of representations provides a notion of more similar 
serving in the interpretation of the comparative form of the adjective similar. More 
importantly, the notion of more similar is exploited in the interpretation of multi- 
dimensional adjectives in general—positive as well as comparative forms. Anna is 


34Since we have to select the granule first, it is a kind of ‘local’ equivalence class. 
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a 
length a 
y 
Cy = p(Anna’s table) 
*— yn*¥ 
P, p cl (ip) H(Berta’s table) 
vl, width 

b 
length 


eo ~ 


Cl) a p 


H(Anna’s table) 
H(Berta’s table) 


width 


Fig. 9 a Quasi-exactly interpretation, two dimensions, p* N cl_,([r]¢) = Ø. b Quasi-exactly 
interpretation, two dimensions, p* N cl~ ([r]F) Æ Ø 


healthy is true if Anna’s health is similar to a (contextually determined) healthy 
prototype. Anna is more healthy than Berta is true if Anna’s health is more similar 
to the prototype than Berta’s health. 

The core of the formalism are sets of representations equipped with a preorder 
structure (transitive, reflexive, but maybe not antisymmetric). This preorder imple- 
ments a concept of granularity and granularity change. It will be used to construct 
a predicate more_similar based on a similarity relation defined by indiscernibility. 
For two representations F and F’ we can ask whether one is more fine-grained than 
the other, that is, whether there are entities that can be distinguished in one represen- 
tation but not in the other. Distinguishability is the opposite of indiscernibility and 
depends on the attribute spaces and the available classifiers. Therefore, these param- 
eters determine the granularity of representations. We will introduce a reflexive and 
transitive relation on representations (a preorder), which relates granularity levels. 


Definition 8 Granularity of representations 


Given two representations 
F = (F, m, _*, D) with D = (D, +, _~, P) 
Fiz (F', W, Y D’) with D’ = (D’, =, P’) 
we define: 
F' is at least as coarse as F, F' > F iff there is a function f such that 


(a) the following diagram commutes: 
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w 
D’ —— j! 


UI f 


Dp 


b) Yx yE F: xxr y> fa) ~r fO). 


This definition states that what is indiscernible in the finer representation cannot be 
discriminated in the coarser representation. The strict version F’ is coarser than F, 
F'> F, can be defined by the non-strict one: 

F'> F iff F’ > F and not F > F' 


What we need now is a specification of a relevant set of representations H. The 
coarser relation then turns H into a preorder. We call such a structure a hierarchy 
of representations. What is missing to get a partial order from a preorder is the anti- 
symmetry axiom: from F > F' and F' > F we cannot conclude that F = F'. We may 
have different possibilities to get the same structure of granules. These hierarchies 
are related to the concept of context (van Rooij 2011). 


Definition 9 Hierarchy of representations 


A hierarchy 7 is a set of representations such that for any two elements 
Fin = (Fin, chin), Mins —*125 (Duz “12. ~~ 12 Pin)) € 


we postulate the following constraints*>: 


e consistency: Vp € Py N P2: (pt! x pH) N (p> x pa =G 
Elements of p* and p` cannot change roles in different domains. 

e discriminative power: Yp € P; N P2: (pt! x p!) N (D2 x D2) 4B > p° x p°? 
ZV 
Ifa domain contains a discriminating pair of another domain for a shared predicate 
identifier, it must itself contain a discriminating pair.’ 

e connectedness: 
AF = ((F, cl), u, _*, (D,_*, ~,P)) EH:D CDAD, CDAP, CPAP) CP 
and there are continues closure preserving functions f 1: F > Fy with uin = 
fine h. 
For any two domains there is an enclosing domain. 


These constraints can be visualized by the Venn diagrams in Fig. 10-13: 


35(a) and (b) are adaptations of the context constraints in (van Rooij 2011: Definition 1). 

36If we have big and small elephants and view them as animals, then there should be big and small 
animals, too. Either there are small animals like mice or, if all animals have the size of elephants, 
then small elephants must be small animals, too. See the Venn-diagram in Figs. 11 and 12. 
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xE bigžni als Nn DiQziephants E€ Diganimats N Digeiepnants 


Fig. 10 Consistency 


discriminative for 


big clephants _ 


discriminative 
for big animals 


Fig. 11 Discriminative power, 1 


discriminative for 
big clephants 


discriminative 
for big animals 


Fig. 12 Discriminative power, 2 


(a) the consistency constraint rules out cases like this: if y is a big elephant and x 
is a small (not big) one, then x cannot be a big animal if y is a small (not big) 
one (Fig. 10). 
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big 
animals 


Fig. 13 Connectedness 


(b) discriminative power: If there are big and small elephants there must be big 
and small animals, too, because animals are different in size: We have big and 
small elephants which are animals. 


bl. If we collect elephants and mice in one animal domain, then a mouse 
(big or not) is a negative example for big animals. Thus we have a 
discriminating pair for big animals (Fig. 11). 

b2. If we collect only big animals in one animal domain, say elephants, 
hippos, and rhinoceroses, then any discriminating pair for these species 
is also discriminative for big animals (Fig. 12). 


(c) connectedness: For any two domains there must be a super domain containing 
both (upward directed) (Fig. 13). 


In the remainder of this section we assume that there is a contextually given hierarchy 
of representations H. Our approach is non-constructive in the following aspect: We do 
not construct representations and hierarchies, but instead have systems of constraints 
which hierarchies must obey. The instantiations must be given by, e.g., the situation 
of the utterance. 

We will now demonstrate how to define a general relation more_sim(a, b, c, d, F) 
based on our similarity relation sim and the preorder on representations. The relation 
more_sim(a, b, c, d, F) is intended to be true if a is more similar to b than c is to d 
with respect to a representation F. 


Definition 10 More similar 


Given a hierarchy H, a similarity relation?” sim, and a representation F € H, we 
define 
more_sim(a, b, c, d, F) iff 


(a) AF'EH: F > F^ sima, b, F) \ 7 sim(c, d, F') 
(b) YF EH: F' > F = (sim(c, d, F) > sim(a, b, F’)) 


37We discussed different similarity relations (see Sect. 4). In this definition, we can use any of these. 
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The widely used version more_sim(a, b, c, F) in the sense that a is more similar to 
b than c is similar to b can be defined straightforwardly by: 
more_sim(a, b, c, F) = more_sim(a, b, c, b, F) 


If a is more similar to b than c to d in a given representation F it must be possible 
to discriminate between c and d. Otherwise, because c and d are maximal similar, 
a and b cannot be more similar than c and d. If we can discriminate between c and 
d in F' then we can discriminate between c and d in every finer representation but 
maybe not in every coarser one. If we can find a representation F’ (maybe coarser 
than F), such that we can discriminate between c and d but not between a and b 
(Definition 10a), we are almost done. It remains to exclude contradictions, that is, 
representations in which we can discriminate between a and b but not between c and 
d (this is excluded by Definition 10b). 

The diagrams in Fig. 14 and 15 show example hierarchies of representations 
talking about color and size of objects (each circle stands for a representation). We 
start with Fig. 14. 

Representations which are higher in the hierarchy are coarser than lower ones. 
On the left branch we introduce a dimension color and a classifier system based on 
{yellow*, light-blue*, blue* } which can classify colors by convex subsets of a (three- 
dimensional) color space. On the right branch, we introduce a dimension size with 
a corresponding classifier system {small*, big*, huge*}. The bottom representation 
integrates the left branch and the right branch (Definition 9 connectedness). Again, 
the size dimension need not to be a simple proportional scale. It can itself be a 
three-dimensional vector space with sub-dimensions length, width, and height. 

According to the Definition 10a, the more_sim relation will be inherited from 
top to bottom along the coarser relation. In the circles, we see the extensions of 
the corresponding P* elements. Next to the circles we see the statements about 
more_sim which are true in these representations. These statements depend not only 
on the representation they are attached to, but on the whole upper structure (the filter) 
of the representation. If we look at the circle at the bottom F,,,, we see that we inherit 
two statements, both from the left branch: 

more_sim(y, Z, X, Fe+s) and more_sim(z, y, x, Fets). 


From the right branch, we inherit nothing because the classifier system is too weak. 
Representations may inherit inconsistent information from different paths which rule 
out some of the statements (by Definition 10b). We can see this when we add more 
powerful classifiers to the right branch, see Fig. 15. 

The two heavily bordered circles (F, and FR) are alternatives which have different 
effects on the more fine-grained representations (below). The representation Fs 
(circle below F, and Fp) inherits more_sim statements though some are ruled out 
by the consistency constraint (Definition 10b). In the bottom circle F,,, all state- 
ments are ruled out by the consistency constraints if both F, and Fp are present 
in H. In Fes, more_sim(z, y, x, Fc+s3) would be true (z is more similar to y than x 
is) according to color because of F.” and Definition 10a. In this case, we cannot 
discriminate between z and y, but we can discriminate between x and y. According 
to the existential quantifier in Definition 10a, this is propagated downwards. On the 
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Fig. 15 Hierarchy of representations, Example 2 
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other side, in Fr we cannot discriminate between x and y. According to the Definition 
10b and the universal quantification we should not be able to discriminate between 
z and y in this representation, but we are. Therefore, we get a contradiction. 

Since in a natural language utterance the hierarchy of representations is not explic- 
itly expressed, we can interpret the meaning of an utterance like A is more similar to 
B than C only as constraint on the relevant hierarchy of representations. 


6 Conclusion 


We presented a framework introducing a non-metric and qualitative concept of 
similarity suitable for the interpretation of similarity in natural language. 

The basic idea is to “measure” properties of individuals with the help of multi- 
dimensional attribute spaces representing relevant features of comparison (thus 
generalizing the idea of degree semantics). In our framework, attribute spaces are 
complemented by classifiers which are predicates on points in attribute spaces 
approximating domain predicates; this is what we define as a representation. Indi- 
viduals count as similar with respect to a particular representation if their values are 
indistinguishable. 

In our framework, the granularity of the similarity relation may vary due to 
different dimensions of comparison and classifier systems. This leads to sets of 
representations forming hierarchies of different granularity levels, where the order 
on representations facilitates a Kleinian style notion of more similar. 

This system provides a powerful and flexible tool to capture the meaning of natural 
language similarity expressions and account for the role of similarity in ad-hoc kind 
formation as well as equative comparison. Future work will explore its capacity in, 
e.g., multi-dimensional comparison of adjectival, nominal and verbal properties. The 
general idea of our approach is to reconstruct comparison in natural language in a 
qualitative way, with the help of different levels of granularity imposed by constraints 
on systems of classifiers. 
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1 Introduction 


Numerical concepts are an integral part of everyday conversation and communica- 
tion. While mathematicians assign a precise interpretation to a natural number, e.g., 
5 being exactly 5, the use and understanding of numerical expressions in natural 
language have a high variability. Broadly speaking, scientists use numbers more 
precisely when they discuss their research results (for example, 0.051 and 0.049 
make a big difference in term of statistical significance) than street vendors at a flea 
market of Berlin (e.g., 51 or 49 cents for a broken antique glass are probably equally 
good results). In addition to broad context, narrower context such as questions under 
discussion (QUDs, Roberts 1996) or decision problems can influence the interpreta- 
tion of numerical expressions as well: If a waiter asks “How many beers would you 
like to order?”, we mean exactly 10 when we say 10, no more no less. If a student 
is eligible for taking the exam with 2 assigned tasks, s/he is eligible with 2 assigned 
tasks—2 means at least 2. In contrast, if a student can pass the exam with 10 mistakes, 
10 means at most 10. Furthermore, the interpretation of numerical expressions can 
also be subject to individual and developmental factors (e.g., Musolino 2004). In 
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this paper, we will focus on the interpretive variability of numerical expressions in 
narrow linguistic contexts, namely, the nature of a number itself, and its co-occurring 
expressions. 

Among others, the interpretation of numerical expressions depends on the 
perceived “roundness”: Round numbers (e.g., 50) can have both an imprecise or 
precise interpretation, whereas non-round numbers (e.g., 47) tend to have a precise 
interpretation. For example, Krifka (2002, 2007, 2009) proposes a “RNRI” (round 
numbers round interpretation) principle: “Round number words tend to have a round 
interpretation in measuring contexts”. Supporting evidence comes from the highly 
frequent use of round numbers in, among others, newspapers or street/distance signs, 
even though statistically speaking, it is very unlikely that the results of measurements 
are round more frequently than they are not (given sensitive instruments). In (1), taken 
from the Leipzig Wortschatz Corpus (Goldhahn et al. 2012), it is intuitive to assume 
that all the numerical expressions have an imprecise interpretation. 


(1) a. Forty thousand people in the state remained without water, and 26,000 
people were without electricity, she said, warning once again that people 
should stay inside. 

b. Gibraltar Airport - Located just 500 meters from the city center, Gibraltar’s 
airport landing strip shares space with one of the island’s main roads. 


Another piece of evidence is shown in the contrast between (2a) and (2b). Whereas 
(2a) is acceptable to characterize situations where John made 49 cupcakes, the use of 
(2b) is degraded in the same contexts. This shows that in contrast to round numbers, 
non-round numbers have a precise interpretation. 


(2) a. John made 50 cupcakes. 
b. John made 48 cupcakes. 


A second factor contributing to the varying interpretation of numerical expressions is 
the type of approximator used in the expression. Precise approximators (e.g., exactly) 
impose a precise interpretation, whereas imprecise approximators (e.g., roughly, 
approximately, about) do the opposite, see (3a). However, due to the tendency of 
non-round numbers receiving a precise interpretation, it has been pointed out in 
Sauerland and Stateva (2011)! that it is odd to use them together with imprecise 
approximators, as can be seen in the contrast in (3b). 


(3) a. John made exactly/roughly 50 cupcakes. 
b. John made exactly/?roughly 48 cupcakes. 


While the first and the second factors have received extensive treatment in the 
literature (a.o., Lakoff 1973; Rips et al. 2007; Krifka 2007, 2009; Sauerland and 
Stateva 2011; Kennedy 2013; Solt 2014), there is a third factor affecting the 


'What should also be noted with respect to approximators is Geurts’ (2006) sharp observation that 
precise approximators can only modify expressions that already have an exact meaning—while 
exactly five sneezes or precisely half the cake are perfectly acceptable expressions, exactly tall or 
exactly some cookies are not. 
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interpretation of numerical expressions which to our knowledge has largely been 
unexplored, namely, the unit of measurement. Consider (4): the combination of an 
imprecise approximator and a non-round number is not odd, which stands in contrast 
to “roughly 48 cupcakes” in (3b). The difference between the targeted expressions 
is that the unit “cupcake” in (3b) is discrete and the one in (4) “meter” is continuous. 


(4) The tower is exactly/roughly 48 meters high. 


The current paper examines these three factors in detail, as well as their ways of inter- 
action. The paper is structured as follows. In Sect. 2, we provide a review of related 
works from theoretical linguistics. In Sect. 3, we report on a corpus-linguistic study 
with the following main findings: imprecise approximators occur more frequently 
with round numbers (e.g., roughly 50) than with non-round numbers (e.g., roughly 
48). Furthermore, discrete units occur significantly less frequently than continuous 
units in the latter combination (e.g., roughly 48 people vs. roughly 48 meters), 
which indicates the imprecise nature of the continuous unit. In Sect. 4, we report 
a rating study testing the naturalness of imprecise approximators in combination 
with different kinds of numbers and different kinds of units. Our results show both 
effects by Number and Unit but no interaction between them. Section 5 provides a 
general discussion and concludes the paper. 

Generally speaking, this chapter provides insights into the representation and 
application of numerical concepts. We focus our research on the usage and interpre- 
tation of these concepts in natural language texts, using the results of both a corpus 
study and a rating study. In our literature review, we summarize different formal 
models for representing the meaning of numerical expressions, which can be seen as 
(partial) representations of numerical concepts. In our two studies, we then seek to 
confirm the qualitative predictions made by these models about the practical usage 
of such numerical expressions. Our work can be related to the contribution by Gust 
and Umbach (Chap. 4) who also consider the granularity of interpretation for natural 
language phrases. While their work targets similarity expressions of varying kinds, 
we put our focus on expressions that involve concrete numbers. Our experimental 
rating study can be related to the procedure by Scerrati et al. (Chap. 6) who record 
binary responses on individual words, while we make use of Likert scale ratings 
on complete sentences. Finally, the focus on the interpretation of natural language 
phrases is also investigated by Vernillo (Chap. 8), who uses a theoretical analysis of 
individual verbs based on image schemata, while we perform a corpus study and a 
rating study on more complex phrases. 


2 Theoretical Background 


In this section, we provide a detailed discussion of the three linguistic factors influ- 
encing the overall interpretation of numerical expressions, based on the literature. As 
our concern is on their semantics and pragmatics, we assume a simplified “NumP” 
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(i.e., number phrase) structure for them consisting of aNumP-modifier (e.g., exactly), 
a Num head (e.g., fifty), and an NP complement (e.g., people), but are open to 
alternative syntactic structures. 


2.1 Number: Round Versus Non-round 


The discussion of round in contrast to non-round numbers is heavily intertwined with 
the topic of the granularity of scales in which we think. Thinking on a coarse-grained 
level can be seen as thinking in gross bins. A fine-grained, possibly continuous (i.e., 
maximally fine-grained) scale is simplified by turning it into a discrete scale with 
fewer values, therefore coarse-grained thinking means simplified thinking. While 
these few values are salient and meaningful in the way that we can quickly process 
and interpret them in a given context, using coarse-grained scales potentially results 
in less precise reports in measuring contexts. 

If we look at scales of different granularity levels such as (5), we will find that 
round numbers appear both on fine-grained and on coarse-grained scales. This is 
not the case for non-round numbers—the more coarse-grained a scale becomes, the 
fewer non-round numbers it contains. 


(5) Scales progressing in steps of 10, 5 and 1 respectively 


c. 0..1..2..3..4..5..6..7..8..9..10..11..12..13..14..15..16..17..18..19..20 


Only values on a coarse-grained scale however can represent a whole range of other 
values; thus, since the values appearing on coarse-grained scales usually are round 
numbers, round numbers logically allow for an imprecise interpretation. In contrast, 
non-round numbers do not appear on coarse-grained scales and therefore only lend 
themselves to a precise interpretation. Thus, one would rather interpret expressions 
imprecisely that make available an imprecise interpretation than expressions that do 
not allow such an interpretation. This is why we tend to interpret round numbers 
imprecisely and non-round numbers precisely. 

But what does ‘round’ really mean? The concept of roundness depends on the 
context. Solt (2014) speaks of a gradient nature of roundness, meaning that there is 
a ‘more’ and a ‘less’ to roundness: the hierarchical ordering of scales with respect 
to granularity yields this gradient. For example, 5 can be considered round since it 
also appears on the more coarse-grained scale (5b), but less round than 10, which 
appears on an even more coarse-grained scale (5a). In some cases, a number might 
be considered round if it only has—or is rounded to—two decimal places. In other 
cases, non-round numbers can take on the same function as round numbers, such 
as 12 or 24 h in a coarse-grained time scale (see more examples in Krifka 2007). 
In other words, the availability of an imprecise interpretation of a number does 
not necessarily depend on it being round; it rather depends on its coarse-grainedness 
within a system of representation. As our numerical reasoning most commonly makes 
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use of the decimal system however, which is a base-ten numeral system, round 
numbers like 10, 100, etc. and simple fractions of them most frequently coincide 
with coarse-grainedness and are thus more likely to be interpreted imprecisely. 

Krifka (2007, 2009) assumes two general pragmatic principles from which he 
derives (and which shall explain) the RNRI (“Round Numbers, Round Interpreta- 
tions”) phenomenon: (I) weak preference for simple expressions, (II) strict preference 
for truthful interpretations. The first principle explains why round numbers are used 
more imprecisely than non-round numbers. The second principle explains why round 
numbers are interpreted more imprecisely than precisely. 

In more detail, Krifka assumes a conditional preference for simple expressions, 
which explains the approximate usage of round numbers in contexts that do not 
require high precision. If a speaker has the choice between uttering forty-eight or 
fifty, he will most likely choose the simpler expression, for reasons of communication 
efficiency. The preference is conditional in the sense that it can only come into effect 
if the difference between the two numbers is not relevant in the context (e.g., with 
specific QUDs or decision problems). Under a precise interpretation, however, the 
preference cannot come into effect; the speaker does not have the choice between one 
expression or the other. Krifka models the virtual equivalence between two measure 
expressions in low-precision contexts in the following way: Under an approximate 
interpretation, numbers represent ranges which can be characterized by a mean, i.e., 
the number which the interval is centered around, and a standard deviation, defining 
the borders of the interval, which also indicates the level of imprecision.” Naturally, 
ranges of two numbers can overlap if the values are close to each other. Two numbers 
are said to be indistinguishable from each other under an approximate interpretation if 
the ranges they represent overlap in such a way that their means are within their stan- 
dard deviations. Under an approximate interpretation, forty-eight could for instance 
represent the range [46, 47, 48, 49, 50] (having the mean 48 and the standard devia- 
tion 2), whereas fifty would represent [48, 49, 50, 51, 52] in that case. Their means 
are within their standard deviations, so they are considered indistinguishable under 
this approximate interpretation. However, fifty has the advantage over forty-eight 
in that it has a simpler form (and is also otherwise more cognitively salient). The 
speaker thus chooses to utter fifty instead of forty-eight in a context where approxi- 
mate interpretations are licensed. This also explains why non-round numbers are not 
interpreted in an approximate way: Once there are several indistinguishable alterna- 
tives one could make use of when reporting a measurement, the alternative with the 
simplest form is chosen, which excludes non-round numbers from the race. 

Under a precise interpretation, numbers denote only themselves: forty-eight 
denotes 48 and fifty 50. The possibility of choosing between alternatives does not 
arise because their denotations are clearly different. 


(6) a. John made 50 cupcakes. 
b. John made 48 cupcakes. 


More specifically, Krifka models an imprecise number as a normal distribution which is centered 
around the number. To simplify things, he confines his discussion to a representation in terms of 
intervals. 
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Assuming a context which licenses an approximate interpretation, Krifka’s model 
explains the acceptability of (6a) since fifty represents the range [48, 49, 50, 51, 
52] which includes 48 and 51. If the context requires a precise interpretation, fifty 
represents only 50; the usage of this numeral thus would make (6a) false in situations 
where John made 48 or 51 cupcakes. Similarly, forty-eight in (6b) could represent the 
range [46, 47, 48, 49, 50] under an approximate interpretation. However, the speaker 
would have uttered fifty in such a situation, since under an approximate interpretation 
fifty is indistinguishable from forty-eight, and it is simpler. Thus, forty-eight cannot 
be interpreted imprecisely here—instead, it must denote solely its own value. 

The second principle ought to explain an assumption specific to Krifka’s theory. 
By way of principle (II), the preference for truthful interpretations, Krifka explains 
why an approximate interpretation of an encountered round number is more sensible 
than a precise one. Krifka holds the assumption that we prefer an imprecise interpre- 
tation of round numbers and therefore usually interpret round numbers imprecisely 
(an assumption challenged by Ferson et al. 2015). He argues that an imprecise inter- 
pretation maximizes the probability of truth of the statement: It is more likely that the 
value of a reported measurement is in the range of the interval around the reported 
number (which amounts to an approximate interpretation) than it is likely that the 
value is the number itself (which amounts to a precise interpretation). And since 
Krifka also assumes that we follow principle (II), he concludes that the approximate 
interpretation is the preferred one. On the other hand, an addressee can conclude 
from an utterance containing the more complex expression that a precise interpreta- 
tion must have been intended since this is the only context where complex expressions 
are used—whenever possible, i.e., under an approximate interpretation, the simpler 
expression (which coincides with round numbers in this case) is chosen over the 
more complex alternative. 

So far, Krifka’s argumentation had little to do with a theory of granularity. One 
might ask however why it is generally the case that round numbers are simpler 
than non-round numbers. It turns out that the superficial simplicity argument can be 
reformulated in terms of the scale granularity framework. Krifka points out that it 
is not just the simplicity of the form of some expression that contributes to whether 
it is interpreted precisely or imprecisely. Instead, what matters even more is the 
expression’s simplicity in terms of representation. This is where scale granularity 
becomes important. The simplicity of representation is marked by whether a value 
is cognitively salient on the scale of reference. 

A numerical representation might be perceived as simple (more easily graspable) 
if it appears on coarse-grained scales of the unit. It becomes clear that the term 
simple here refers to how easily we can process the conveyed bit of information, as 
in the aforementioned example of time scales {0, 12, 24, 36, 48, ...}. Notice that 
twenty-four is neither simpler than twenty-three in terms of form nor round. It is 
because of the expression’s simplicity of representation and persistence throughout 
scales of different granularity levels that a speaker might choose twenty-four over 
twenty-three under an approximate interpretation. 

We can conclude that a simple representation promotes an imprecise interpretation 
because it allows one to reason on a coarse-grained level of scales. Krifka additionally 
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argues that in many cases, simplicity of expression and simplicity of representation 
coincide—not coincidentally, but because the frequency of use dictates such a devel- 
opment. Simplification of expressions is a result of an increase of frequency due to 
their additional approximate use: “salient representations tend to be shorter, and tend 
to be shortened in language change” (Krifka 2007). 

Generally speaking, a characteristic of a round number is that it is simple: self- 
contained (no infinite decimal places) and conceptually graspable and decodable; it 
is a number that exists in a simple system of representation (for instance a system 
of multiples of tens)—the system depends on the context of use. In this paper, 
we will restrict our empirical analyses to a limited set of (conventionalized) round 
numbers (e.g., 10-roundness and 5-roundness, which do not need contextual support) 
in contrast to their non-round close numbers. 


2.2 Approximator: Approximate Versus Exact 


While we have discussed that (im)precise interpretations of numerals can arise from 
implicit assumptions about the numbers themselves, there is also an overt means 
for marking the intended level of precision. Approximators like exactly, precisely, 
around, and approximately are classified as hedges (Lakoff 1973): Expressions which 
modify the certainty, force, or precision implied by statements. Also belonging to this 
class are expressions like maybe or I assume (called shields), which can modify whole 
sentences. Approximators are a means of explicitly marking the degree of precision 
with which a measure expression is to be interpreted, but on a different level, the use 
of approximators also reveals something about the certainty with which a speaker 
utters something. The latter is evident if we consider uses of the approximators as 
speech-act adverbs, e.g., Roughly speaking, I have 50 students in my class. We leave 
it for future studies what differences such sentences have compared to I have roughly 
50 students in my class. 

When a speaker intends to indicate a high certainty about the accurateness of 
the uttered numeral, they likely use precise approximators. When doing so, the 
speaker simultaneously decreases the risk of conveying false information, which is 
higher with an unmodified alternative. In other words, using approximators increases 
the probability of the truthfulness. Thus, using imprecise approximators can also 
signal the speaker’s uncertainty in addition to imprecision in measuring, which is 
emphasized in Ferson et al.’s (2015) work. 

While Krifka’s (2007) work is not concerned with the effect of approximators 
on numerical expressions, Solt (2014) extends the granularity-based framework 
to provide an account of these modifying expressions. She also introduces a new 
formalism for determining truth or falsity of sentences with numerical expressions 
that includes a contextually determined granularity level. In her analysis, the overt 
use of approximators in combination with numerals is modeled as a mapping from 
point-denoting expressions (the bare numerals) to intervals around these expressions. 
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Explicitly modified numerals thus denote a scalar segment. Solt formally defines the 
semantics of approximators as in (7): 


(7) [[APPROXIMATOR nJ]é = (n — gran’/2, n + gran’/2) 


For imprecise approximators, gran’ is the coarsest possible unit for a granularity level 
one could choose given the context. For precise approximators, gran’ is the finest 
possible choice of a granularity level given the context. Thus, [[about 50]]#!8@"=!9 
would denote the interval (45, 55) in the appropriate context. It becomes clear that the 
denotation of a modified measure expression differs from the original numeral in that 
it (roughly) denotes the range of values halfway between the neighboring values on 
the coarse-grained level. In formal semantic terms, this complex expression however 
still is of type “degree” despite not denoting a point. 


(8) — [[exactly fifty]]88""-0-0) — (50 — 0.01/2, 50 + 0.01/2) 


Notably, Solt’s analysis of approximators, as shown in (8), yields as a result that 
precise approximators can make an expression more imprecise after being combined 
with the approximator. Although the granularity level is very fine-grained (with gran’ 
being 0.01), the resulting complex expression denotes a more coarse-grained degree 
than the bare, unmodified numeral, namely, (49.995, 50.005) instead of 50. On the 
one hand, the analysis of the complex expression is not counterintuitive since in 
some contexts the usage of a precise approximator does not signal maximal but only 
increased precision. However, what seems unintuitive is that the bare numeral in 
contrast can never denote anything more imprecise than the maximally precise point 
it always denotes. The denotation of the numeral modified by a precise approximator 
is more imprecise than the denotation of the unmodified numeral. This conflicts 
with the empirical findings of Ferson et al.’s (2015) study that precise approximators 
(exactly and precisely) rather reduce a previously assumed range of imprecision 
associated with a numeral instead of making numerals more imprecise. 

Since Solt’s theory does not assume numerals to denote ranges in the first place, 
there is no way she can model how an approximator can reduce the interval of 
imprecision that might be associated with a numeral. Thus, this analysis cannot 
explicitly model situations in which the context favors a default imprecise reading of 
anumeral while the approximator is used to override this reading. This is only possible 
within theories that overtly model the imprecision of a numeral such as Krifka who 
lets unmodified numerals denote ranges under an imprecise interpretation. These 
representational issues Solt’s theory faces due to the assumption of a monosemous 
exact denotation of numerals might not pose problems in terms of truth-conditional 
analyses. However, they show that Solt’s model is also not entirely optimal as it 
seems odd to assume that exactly fifty denotes a coarse-grained degree while fifty 
does not. 

An alternative relates to Lasersohn’s theory (1999) of pragmatic halos in which he 
also proposes an analysis of approximators. Lasersohn takes precise approximators 
to be narrowing the so-called “pragmatic halos” of an expression: “Suppose, for 
illustration, that there are two points in time close enough to i that the difference 
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between them and i is ignored in context, so that the halo of three o’clock is the 
set {i, j, k}, ordered according the relation of closeness to i .... The real effect of 
exactly is on pragmatic halos: we want the pragmatic halo of exactly three o’clock to 
include those elements of the halo of three o’clock which are closest to i (that is, to 
the actual time of three o’clock), eliminating outlying elements.” (Lasersohn 1999: 
p. 528). In this analysis, precise approximators have no effect on the semantic level, 
however, they reduce the pragmatic slack with which one may speak and thus have 
an effect on whether an utterance can be used felicitously or not. This is not the case 
for imprecise approximators: They are analyzed to have the effect of expanding the 
denotation of the expression (they are combined with) into its halo. Thus, they have a 
clear truth-conditional effect in that the resulting denotation is ‘enriched’ by similar 
denotations, constituting a set. 

Combining Sects. 2.1 and 2.2, a natural question arises as to how numbers interact 
with approximators. We will not be able to work out a formal analysis here, but focus 
on the distributional constraints due to the different levels of precision encoded in 
them. 


2.3 Unit: Discrete Versus Continuous 


Seeing numbers as part of a mathematical system, we find that at the most basic 
level, number systems permit the description of quantities by means of expressions 
consisting of a numeral and a unit, where the unit specifies the scale of measurement. 
Units can, for instance, be ‘people’, ‘buildings’, ‘chairs’ for discrete quantities, but 
also ‘days’, ‘acres’, ‘metres’ for continuous quantities. 

Accordingly, a numeral can be an integer or real-valued; it furthermore can 
be expressed in words or numerical digits. Since units measure either discrete or 
continuous quantities, they can influence the numerals they appear with. Those units 
measuring discrete quantities restrict the numeral they combine with to the domain of 
integers. When measuring quantities physically, the numerical expressions used for 
description are almost always used imprecisely, especially in the case of measuring 
continuous quantities. Ferson et al. (2015) thus suggest a distinction between the 
mathematical and the ‘real world’ interpretation of a numerical quantity. Following 
this distinction means assuming that in non-mathematical contexts an unmodified 
scalar number already elicits an interpretation with an interval of imprecision; the 
expression might refer to any value within this interval. In contrast to this suggestion, 
however, Ferson et al.’s (2015) empirical study found that participants (who were 
asked to specify an interval the numbers can stand for) interpreted bare, unmodified 
numbers precisely in 94% of the time, despite the fact that the expressions were 
embedded in a natural language context. 

What are the effects of units on the distribution and interpretation of number words 
and expressions? We will provide partial answers to this understudied question in 
the rest of the paper. 
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2.4 Summary 


In summary, the use and understanding of numerical expressions are subject to influ- 
ences from both broad discourse contexts and narrow linguistic contexts. In the paper, 
we will not provide formal analyses for numerical expressions; instead, we focus on 
the empirical testing of the observations from the literature and the current work. 
In the following, we will discuss numerical expressions with two goals: First, we 
will provide empirical (i.e., corpus- and psycholinguistic) evidence for the general- 
izations related to the distinction between round and non-round numbers. Second, 
we will provide empirical evidence for the effect of unit in the interpretations of 
numerical expressions. 


3 Corpus Study 


3.1 Hypotheses 


The aim of the corpus study is, first of all, to support the initial observation made, 
namely that round numbers seem to appear more frequently in natural language 
contexts than expected if they only had a precise usage. If confirmed, this more 
frequent appearance is taken as support for the claim that round numbers, in 
addition to denoting their own values, are used imprecisely due to context (e.g., 
when imprecision prevails over precision, or when the speaker is uncertain about 
the actual precise values). Their additional use for this purpose would explain 
the prevalence of round numbers throughout natural language data. Furthermore, 
the analysis has been conducted to shed light on the distribution of approximator 
(null/precise/imprecise), numeral (round/non-round) and unit (discrete/continuous), 
as well as possible patterns in their conjoint appearance. 

Based on the theoretical considerations in Sect. 2, we started with the following 
hypotheses where xy denotes the probability of the number i occurring in natural 
language communication: 


(9) HO: mi =.=... = T500 
H1: mi 4124... Æ M500 


In the null hypothesis HO, each numeral is assumed to appear with an equal probability 
in the corpus. The corpus study restricts numerical analysis to numerals in the range 
between 1 and 500, hence the notation above. Say the probability of appearance of 
each numeral is 1/500, then we expect round numbers (i.e., numbers ending with a 0 
or 5) to appear 20 percent of the time (100 out of the 500 numbers are round) whereas 
non-round numbers should appear 80 percent of the time (the remaining 400 out of 
500 numbers are non-round). 
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Our first hypothesis is captured in the H1: We expect that the probability of 
appearance is not equal for every numeral. More specifically, related to H1, we 
assume that round numbers appear more often than expected (i.e., >20%). 

Secondly, we assume that the default interpretation of numerals in general is 
precise, following Ferson et al. (2015) and the findings in their study (and contrary 
to Krifka 2007). As a consequence, a precise interpretation often does not have to be 
signaled explicitly whereas imprecise approximators are needed to signal an intended 
imprecise interpretation. Thus, our second hypothesis is that precise approximators 
appear less frequently than imprecise approximators. 

Thirdly, in terms of combinations of approximators and numerals, let us recall 
example (3b) or (10a) from Sauerland and Stateva (2011), which they take to be 
odd. Since imprecise approximators usually signal a coarse granularity level, the 
appearance with a non-round number (which only appears on more fine-grained 
scales) strikes the reader as peculiar. We will therefore expect that imprecise 
approximators tend to appear with round numbers. 


(10) a. # What John cooked were approximately 49 tapas. 
b. The rope is approximately 49 metres long. 


Furthermore, theoretical accounts so far mainly focused on the interaction between 
approximators and numerals. Ferson et al. (2015) examined a potential influence 
of the unit on the interpreted imprecision of a numeral, a hypothesis that was not 
supported by the results of their study. To our knowledge, little attention has been 
paid to the potential interaction between unit, approximator, and numeral, see for 
instance, (10b). Whereas (10a) is odd to the reader, this oddity disappears in (10b), 
which is completely natural. This can be attributed to the fact that the continuous unit 
implies that 49 m can already be used imprecisely (49 is round compared to 48.7) 
whereas this is not the case for discrete numbers (49 is the most precise possible in 
this case and has no imprecise reading). The results of the corpus study will also be 
inspected with respect to this effect. 


3.2 Methods 


The study was based on the Leipzig Wortschatz corpus (Goldhahn et al. 2012), 
containing 1 million English sentences sourced from online news reports and 
general web crawling results. The corpus was searched for numerical expressions 
in the Approximator-Number-Unit fashion. The code was written in python and 
is publicly available online (https://github.com/Ibechberger/CorpusStudyNumerals). 
The matches were analyzed with respect to the following variables: 


(11) Variables 
a. Approximator: precise, imprecise, null 
b. Number: round, non-round 
c. Unit: discrete, continuous 
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Counts kept track of the different combinations. Numbers were counted as round if 
they ended with a zero or five; we only used integer numbers (excluding decimal 
numbers) in the analysis. We only included number words up to five hundred in the 
counts. The categories for the approximator matched for the following words: 


(12) Categories of approximators (Approx.) 
a. Precise Approx.: [‘exactly’, ‘precisely’ ] 
b. Imprecise Approx.: [‘about’, ‘approximately’, ‘roughly’, ‘around’, ‘round 
about’, ‘roughly around’, ‘some’*] 
c. Asymmetrical Approx.: [“more than’, ‘nearly’, ‘over’, ‘almost’, 
‘approaching’, ‘below’, ‘above’, “fewer than’, ‘less than’, ‘at most’, ‘at least’, 
‘close to’, ‘near to’, ‘up to’, ‘as high as’, ‘as low as’, ‘not quite’ ] 
d. Null Approx.: every expression preceding a numeral that does not match 
the words above 


Asymmetrical approximators (based on Ferson et al.’s (2015) list of approximators 
used in his study) were not included in the statistical analysis. Yet, they were also 
matched to obtain an estimate of the frequency of their usage and have a more accurate 
account of the unmodified versus modified numerals ratio. Their appearance with 
either round or non-round numbers was neither recorded nor analyzed (although 
asymmetrical approximators, a.k.a. comparatives, are also a subject of debate in 
current accounts of imprecision (Solt 2014)). The unit was first matched as any 
word following the numeral and subsequently evaluated using WordNet (Princeton 
University 2010) for whether it belonged to one of the following categories: 


(13) Categories of units 
a. continuous: [‘time period’, ‘time unit’, ‘linear unit’, ‘magnitude relation’, 
‘monetary unit’, ‘unit of measurement’ ] 
b. discrete: [‘organism’, “human activity’, ‘group’, ‘location’, ‘transport’, 
‘material’ | 


All numerals occurring with matches that did not belong to any of the categories have 
been excluded; the remaining matches were used for the analysis. The data conse- 
quently had the nature of frequency counts of the aforementioned (Approximator)- 
Number-Unit sequences and of the respective counts of approximators, numerals, 
and units separately. The analysis consisted of testing the match counts against their 
expected frequency: The main hypothesis, that the frequency of round numbers 
is different from their expected frequency, was tested for significance using the 
Binomial Test. The effects in the Number (roundness) * Approximator and Unit 
* Approximator contingency tables were tested using the x? Test. 


3Such as Some 50 students joined the protest. 
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numbers + numberwords frequency 


0 20 40 60 80 100 


Fig. 1 Numeral counts from 0 to 100 


numbers + numberwords frequency 


0 10 20 30 40 50 


Fig.2 Numeral counts from 0 to 50 


3.3 Results and Interpretation 


As can be seen from Figs. | and 2, there are “spikes” of counts for round numbers 
(also visible in the range between 0 and 100) already indicating a marked appearance 
of round numerals in the corpus. The general distribution (a few numbers with very 
high counts and a tail to the right) suggests that numeral occurrences seem to follow 
a power law distribution, specifically one related to Benford’s law (Benford 1939). 
The extraordinarily high count for the numeral | can be explained by the frequent 
usage of the number word in many contexts (e.g., ‘He had one goal.’, ‘A government 
has the energy for only so many fights at one time.’, etc.). 
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Table 1 Frequency counts of matches in the corpus: Approximator * Number (roundness) * Unit 


n= 182,895 Discrete Continuous 
Round Non-round Round Non-round 
Precise 4 7 15 36 (62) 
Imprecise 2,975 423 3,217 2,081 (8,696) 
Null 21,215 64,990 30,535 57,397 (174,137) 
(24,194) (65,420) (33,767) (59,514) 


Table 2 Frequency counts of matches in the corpus: Number (roundness) * Approximator 


n = 182,895 Precise Imprecise Null 

Round 19 6,192 51,750 (57,961) 

Non-round 43 2,504 122,387 (124,934) 
(62) (8,696) (174,137) 


182,895 of the matched numerals were used for the analysis (another 369,384 in 
that range were discarded due to unit constraints). The null hypothesis thus expects 
36,579 of these numerals to be round and 146,316 numerals to be non-round. 

Generally, as in Table 1, we observe the following tendencies: First, non-round 
numbers appear, in absolute terms, more often than round numbers. Second, unmod- 
ified numerals appear most frequently with a count of 174,137, followed by numerals 
modified by an imprecise approximator (8,696 counts) and lastly, numerals modified 
by precise approximators (62 matches). Third, numerals with discrete units (89,614 
counts) appear almost as often as numerals with continuous units (93,281 counts), 
with a ratio of approximately 0.49/0.51. 

More specifically, our findings are stated as follows, see Table 2. First, round 
numbers appear more frequently than expected. As we can read from the tables, 
a total of 57,961 (as opposed to the expected 36,579) round numbers and a total 
of 124,934 (as opposed to the expected 146,316) non-round numbers were counted. 
Instead of an expected 0.2/0.8 ratio, we found a ratio of approximately 0.32/0.68. This 
effect is particularly pronounced if the numerals appear with a continuous unit—the 
ratio between round and non-round numbers is roughly 0.36/0.64 there. Binomial 
testing reveals that this is a significant departure from the expected frequency (p < 
0.01, one-sided). 

Second, imprecise approximators appear more frequently than precise approxima- 
tors. Table 3 shows a total count of 8,696 imprecisely modified numerals as opposed 
to the few 62 occurrences of precisely modified numerals in the given range. This 
undoubtedly supports our assumption that the default interpretation of numerals is 
precise which makes imprecise approximators an important tool to signal that the 
imprecise interpretation is intended, whereas precise approximators are unnecessary 
most of the time. 
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Table 3 Frequency counts of matches in the corpus: Unit * Approximator 
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n= 182,895 Precise Imprecise Null 
Discrete 11 3,398 86,205 (89,614) 
Continuous 51 5,298 87,932 (93,281) 
(62) (8,696) (174,137) 
Table 4 Breakdown of Table 1 with respect to Unit 
Discrete Continuous 
Imprecise Round Non-round Round Non-round 
2975 423 3217 2081 (8696) 


Third, imprecise approximators tend to appear with round numbers, especially if 
the unit is discrete. This is one of the most impressive results from the study: Even 
though generally and in absolute terms, non-round numbers occur more often than 
round numbers, we can read from Table 2 that if numerals occur with an impre- 
cise approximator, the proportions are almost swapped. Even in absolute terms, 
imprecisely modified round numerals occur more often than imprecisely modified 
non-round numerals. This represents strong evidence for our hypothesis that impre- 
cise approximators predominantly appear with round numbers. The deviations from 
the expected frequencies in Table 2 were significant using the x? Test, i.e., x? (df 
= 2, n = 18,2895) = 6585.259, p < 0.01, p. = 0.19. Conversely, this finding 
can be framed in terms of the infrequent appearance of imprecise approximators 
with non-round numbers (see Sauerland and Stateva’s (2011) oddity example (10a) 
mentioned). Arguably, 2,504 occurrences of non-round numerals appearing with 
imprecise approximators is still a substantial count. Resolution however comes from 
looking at Table 4 where a further breakdown of the data with respect to the unit 
category is presented: 

We see that this effect is particularly strong if we are looking at the discrete domain: 
There were 2,975 occurrences of the imprecise approximator-round numeral combi- 
nation, whereas only 423 non-round numerals appeared with an imprecise approx- 
imator there (roughly an impressing 0.88/0.12 ratio). This is in line with Sauerland 
and Stateva’s (2011) observation about imprecise approximators occurring with non- 
round numbers. In contrast, in the continuous domain, this effect vanishes for the 
most part (compare (10b)). This is also reflected in our counts: Although it is still 
the case that imprecise approximators occur more often with round numbers in this 
condition, the count for imprecise approximators appearing with non-round numbers 
is almost equally high and in absolute terms not negligible. This indicates that the 
oddity of imprecise approximators appearing with non-round numbers is drastically 
reduced if these numbers are continuous. We have thus encountered evidence for the 
claim that the unit has an effect on the co-occurrence behavior of approximators and 
numerals. 
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Last but not least, precise approximators tend to appear with continuous units. We 
see that if precise approximators appear at all, they tend to co-occur with continuous 
units (51 occurrences with continuous quantities vs. 11 occurrences with discrete 
quantities, see boldfaced numbers in Table 3). This makes sense to the extent that 
for continuous quantities, the precise interpretation is not trivial. These observations, 
however, should be taken with a grain of salt as we did not have many occurrences 
of precise approximators overall. 


(14) a. Trump announced his candidacy for the Republican nomination exactly 
three months ago. 
b. Belgium’s federal prosecutor’s office says authorities have so far made 
(2exactly) three arrests linked to the deadly attacks in Paris. 


While exactly adds nothing to the already precise interpretation of (14b), in (14a), 
it makes a contribution to the interpretation of the numeral. Since the used numeral 
in (14a) can never be entirely accurately describing the actual time span between 
Trump’s announcement and the report, the degree of accurateness needs to be marked 
explicitly to indicate “how precisely” the expression is meant. In (14a), one can 
assume that the speaker intended an interpretation accurate to the day (i.e., the report 
was made on the same date of the third subsequent month). Unless the numeral is of 
special interest, exactly in (14b) in contrast, appears redundant. 


4 Psycholinguistic Experiment 


To investigate the effect of the unit on the acceptability of numerical expressions, we 
tested English numeral expressions using a 2 x 2 factorial design, with the factors 
Number (round vs. non-round) and Unit (discrete vs. continuous). 


4.1 Materials and Predictions 


We used 24 different matrix sentence items, each in four conditions, see the Appendix 
for the entire list of the items. The experimental items were constructed under the 
following objective: The setout was to choose sentences containing the sequence 
imprecise approximator—round number, which has been motivated to evoke no 
perception of oddity. The sentences were picked from the Leipzig Wortschatz corpus. 
Before selecting the sentences, we determined the round numbers that they ought 
to contain. For this, 12 round numbers were randomly chosen in the range from 10 
to 1000. This yielded the numbers 10, 60, 70, 100, 350, 400, 700, 750, 800, 900, 
950, 1000. We then scanned the corpus for sentences containing imprecise approx- 
imators (about, around, approximately and roughly, six occurrences each) and the 
randomly chosen round numbers that would appear with either discrete or continuous 
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units, resulting in equally many sentences for both the ‘discrete’ and the ‘continuous’ 
condition. 

Based on the experimental items for the round conditions, we created their non- 
round counterparts by changing the round number of each sentence into a close-by 
non-round one. This way we ensured that the non-round number would appear in 
a plausible context and linguistic environment. The oddity could thus only arise 
from the pairing of a non-round number with an imprecise approximator. The four 
conditions are exemplified in (15). 


(15) a. r-disc: As of then, about 60 Cubans had arrived in the Yucatan coast in 
2015. 
b. r-cont: Brigham City is about 60 miles north of Salt Lake City. 
c. nr-disc: As of then, about 61 Cubans had arrived in the Yucatan coast in 
2015. 
d. nr-cont: Brigham City is about 61 miles north of Salt Lake City. 


Additionally, we used 48 filler items as distractors, which were news report sentences 
of comparable length that we also sourced from the Leipzig Wortschatz corpus. We 
did not revise these, as the pragmatic difference at focus is subtle and thus fillers 
containing ungrammatical or odd phrases would be inappropriate. 


(16) The drug investigation began in August 2013 at Edwards Air Force Base in 
California. 


Based on Sauerland and Stateva’s (2011) observation that non-round numbers are 
odd with imprecise approximators and our corpus-linguistic finding that this effect is 
stronger with discrete units than continuous units, we had the following predictions: 
First, there will be a main effect of Number. More specifically, the condition “nr-disc” 
will be rated worse than “r-disc’”’, and the condition “nr-cont” will be rated worse than 
“r-cont”. These predictions are in accordance with the oddity suggested by Sauerland 
and Stateva. Due to the observations made in the corpus study, we included a second 
prediction, namely, there would be a main effect of Unit and possibly an interaction 
between Number and Unit due to a stronger worsening effect with discrete than with 
continuous units. 

We used a Latin Square design, that is, each participant read one set of 72 sentences 
in total. As seen above, the participants’ attention was directed towards the phrases 
of interest by marking the relevant phrase visually, both in experimental and in 
filler items.* For the filler items, the marked phrases were mostly DP’s or PP’s (i.e., 
determiner phrases or prepositional phrases). 


4We highlighted the critical phrases, as in the pretest without doing so, the subjects did not distinguish 
the conditions, raising the question whether it showed no evidence for the effect of unit, or whether 
it was due to methodological issues. 
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4.2 Procedure and Participants 


The experiment was set up with Ibex Farm (spellout.net/ibexfarm/), a website that 
provides free hosting for online psycholinguistic experiments. Experimental data 
was gathered using Amazon MTurk, a crowdsourcing platform where human intel- 
ligence tasks (HITs) can be carried out by participants who receive compensation 
for each HIT completed. Requesters were provided the link to the experiment and 
compensated with $4. Native English-speaking workers on Amazon MTurk (N = 
72) signed informed consent and participated in the study. 

Before entering the experimental phase, participants first completed a practice 
session where 12 practice items were to be rated. During the experimental phase, 
they first read an entire sentence and then were asked to rate the naturalness of the 
underlined phases (which were shown again separately) on a 7-point Likert scale (1 
= unnatural, 7 = natural). 


4.3 Data Analysis and Results 


The descriptive statistics is provided in Table 5 and visualized in Fig. 3. As can 
be seen in the table, descriptively, the “r-cont” condition received the highest mean 
rating, whereas the “nr-disc” condition received the lowest mean rating. The standard 
deviation was also highest for the “nr-disc” condition, indicating an overall lower 
consistency in ratings for this condition. 

We analyzed the data using R. All analyses were performed using mixed effects 
linear regression models; the models were constructed using the Ime4 package in R 


Table 5 Mean naturalness 


i Conditions Mean SD SE 
ratings (1 = unnatural, 7 = 
natural) and standard r-disc 5.82 1.47 0.07 
deviations (SDs) and standard  r-cont 5.90 1.39 0.07 
errors (ES) nr-disc 5.11 177 0.09 
nr-cont 5.33 1.70 0.09 
Fig. 3 Naturalness ratings Naturalness Rating 
of the experiment 7 5.82 5.9 
—=— 511 —— 53 

5 z 

3 

1 

discrete unit continuous unit 


Bround number Mnon-round number 
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(Baayen et al. 2008; Bates et al. 2012). All contrasts of interest were sum coded and 
included as fixed effects in the model. The reported model is the maximal model that 
converged. The model included Number and Unit (with interaction term) as fixed 
effects. Furthermore, we included random intercepts for subjects, items, and stimuli 
order, as well as random by-subject slopes for the effects of Number and Unit (and 
their interaction). 

We found a significant main effect of Number (t = 4.15, p < 0.0001). Tukey’s 
HSD for multiple comparisons of means indicates that round numbers were rated 
significantly more natural than non-round numbers with both continuous (t = 3.96, p 
< 0.005) and discrete (t = 3.84, p < 0.005) units. Furthermore, we found a significant 
effect of Unit (t = 2.11, p < 0.05) in that continuous conditions were rated better than 
discrete conditions. However, there is no interaction between the two factors, which 
suggests that the effect of neither factor is influenced by the presence or absence of 
the other. 

In this study, we were able to confirm our first predictions about the effect of 
Number and Unit. We will leave the reason for the lack of an interaction for future 
studies. 


5 Discussion and Conclusion 


In this paper, we tried to gain insight into our understanding and interpretation 
of numerical expressions with regard to questions such as whether numbers are 
imprecise at the semantic level. 


5.1 Numbers and Number Concepts 


We must keep in mind that the development of the number system as we know it 
now has been a process of cultural construction and added knowledge over genera- 
tions and centuries of historical time. When analyzing how we interpret numerical 
expressions in natural language contexts, insight might be provided by looking at the 
innate numerical concepts humans (and non-human animals) are equipped with for 
reasoning quantitatively. 

Our understanding of number proceeds from concepts that do not conform to the 
structure and characteristics of the natural numbers (Rips et al. 2007). Two main 
mechanisms for quantitative reasoning have been identified for numerical ability 
in infants and non-human animals: On the one hand, a system works with internal 
analog magnitudes—perhaps some type of continuous strength or activation—which 
is a linear function of the input. On the other hand, infants’ skills for quantitative 
reasoning may also draw on discrete and distinct representations of objects that are 
kept in short-term memory—however only less than four items can be represented 
this way. 
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Explained shortly, a mental (i.e., internal analog) magnitude is an internal repre- 
sentation of a quantity—this can be the cardinality of a set, but also duration, length, or 
volume of whatever is registered by the organism. What is special about this represen- 
tation is that it is assumed to represent an objective magnitude in a direct linear rela- 
tionship—in that it constitutes a continuous quantity (e.g., activation strength) repre- 
sented mentally that adjusts to achieve a measure of a quantity. It is thus suggested that 
mental magnitudes share the formal properties of real numbers (Gallistel and Gelman 
2005). However, analog magnitude representations are noisy, and the noise linearly 
increases the bigger the quantities become. This means, the bigger the measured 
values, the more imprecise the representation.” Analog magnitude representations 
of large sets are thus only approximate; they are a coarse representation, contrasting 
with the precision associated with natural numbers. 

The other mechanism makes for an infant’s ability to predict the total number of 
objects in small sets (less than 4) and might be considered conceptually closer to 
the elaborate concept we have of integers. It depends on attentional or short-term 
memory mechanisms that represent individual objects as distinct entities. For each 
object, there is a distinct representation within the four-object capacity limit. A set 
exceeding three items cannot be held in the infants’ short-term memory (Carey 2004). 

Many psychologists believe that full-fledged mathematical thinking mainly origi- 
nates from these two innate concepts that are also shown to be existent in non-human 
animals. Although other researchers argue that these abilities do not seem to be 
adequate prerequisites for forming the mathematical concept we have of numbers 
within a number system (see Rips et al. 2007 for a discussion of this issue), they 
are still shown to have relevance in quantitative and even arithmetical reasoning 
(Gallistel and Gelman 2005). In specific, analog magnitudes are shown to play a 
role in arithmetical computations: comparison of two values, and also addition and 
subtraction (Carey 2004). Indeed, if analog magnitude representations are made use 
of in mathematical contexts, which would most of all require high precision repre- 
sentations, it is likely that they are also employed when encountering numerical 
expressions in a natural language context. 

How, however, do these mechanisms play into the interpretation of numerical 
expressions in natural language, if they do so at all? Krifka (2007) argues that the 
existence of these two distinct systems of representation provides plausibility for 
both an exact and an approximate interpretation of numerals since they work in 
parallel and are not hierarchically ordered in any way. Which one of the two is the 
“original” meaning of a numeral is not settled by this argumentation, it might even be 
that there is none and that both interpretations are equally prevalent. All the findings 
in developmental research however do not comprise or imply an inherent distinction 


SIn this, mental magnitudes follow Weber’s law, according to which the discriminability of two 
values is a function of their ratio: The bigger the physical magnitudes (and consequently the analog 
magnitudes), the harder discrimination between pairs of values that are separated by the same 
absolute difference becomes. 


®Short-term object representations do have the discreteness of natural numbers; however, they do 
not form a set representation of the tracked objects and, consequently, cannot represent cardinality. 
This in turn is represented by mental magnitudes. 
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of round vs. non-round numbers with respect to impreciseness. Thus, at least the 
imprecise interpretation of round numbers (not the general imprecise representation 
of quantities) seems to be a phenomenon “on top of” the basic interpretation of 
numerals, which likely only started to develop after the formation of more elaborate 
mathematical systems. 


5.2 Contributions and Outlooks of the Current Study 


In the current study, we provide a critical discussion of numerical expressions based 
on the recent formal (compositional) semantic literature, focusing on the impre- 
cise and precise interpretation of numerical expressions. While the interpretation of 
numerical expressions depends on both broad discourse context and narrow linguistic 
context, we only dealt with the latter. Our corpus and experimental studies show that 
the interpretation of numerical expressions is subject to the kind of numbers, the kind 
of units, as well as whether and what approximators co-occur with them. 

It is to note that the results we obtained in our study are certainly contingent on, 
for example, the specific corpus study or experimental design, the specific numerals 
(i.e., 0-500) we used, and the specific contexts they occurred (in our case, naturally 
occurring contexts instead of made-up contexts as in usual experimental works), 
thus, whether and to what extent they apply to numerical expressions in general need 
to be investigated in further studies. Furthermore, approximators might differ among 
themselves. For example, even within the imprecise category, roughly and some as 
in some 50 people might have syntactic, semantic, or pragmatic differences, which 
we were not able to handle here. The same holds for Unit which might differ in 
terms of aspects other than discreteness. Another question for future studies is how 
the interpretation of numerical expressions is manipulated by broad context (such 
as QUD, decision problems, developmental, or individual differences, purely infor- 
mation exchanging vs. strategic communication, counting vs. measuring contexts, 
to just name a few parameters). Despite of this, we believe that the method and the 
findings of the paper have made further steps to understanding numerical concepts 
and related concepts that they modify. 
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Appendix: Test Items of the Experiment 
(1./C. For Item/Condition) 


L/C. 


1/1 


1/2 


1/3 


1/4 


2/1 


2/2 
2/3 


2/4 
3/1 


3/2 


3/3 


3/4 


4/1 


4/2 


4/3 


4/4 


5/1 
5/2 


5/3 
5/4 


6/1 


6/2 


Sentence 

People from about 10 municipalities around the wrecked plant still cannot go 
home. 

About 10 hours after the terror attack, someone approached police officers 
near a synagogue and started shooting. 

People from about 9 municipalities around the wrecked plant still cannot go 
home. 

About 9 hours after the terror attack, someone approached police officers 
near a synagogue and started shooting. 

As of then, approximately 60 Cubans had arrived in the Yucatan coast in 
2015. 

Brigham City is approximately 60 miles north of Salt Lake City. 

As of then, approximately 61 Cubans had arrived in the Yucatan coast in 
2015. 

Brigham City is approximately 61 miles north of Salt Lake City. 
Reportedly, around 70 members of the security forces have been killed in 
attacks blamed on the PKK. 

BBC Travel says it is currently taking around 70 minutes to get through the 
traffic. 

Reportedly, around 69 members of the security forces have been killed in 
attacks blamed on the PKK. 

BBC Travel says it is currently taking around 69 minutes to get through the 
traffic. 

Experts say present-day cars have about 100 computers on board, and that 
will double in a few years. 

The helicopter was about 100 feet off the ground when it went out of control 
and smashed into a parking lot. 

Experts say present-day cars have about 101 computers on board, and that 
will double in a few years. 

The helicopter was about 101 feet off the ground when it went out of control 
and smashed into a parking lot. 

The tornado hit on Monday morning, leaving around 350 homes destroyed. 
A hard-charging fire spread across around 350 acres in a matter of hours, the 
county fire department said. 

The tornado hit on Monday morning, leaving around 349 homes destroyed. 
A hard-charging fire spread across around 349 acres in a matter of hours, the 
county fire department said. 

In total, about 400 competitors are entered in this year’s Dakar Rally, in cars, 
trucks, motorbikes and quad bikes. 

The Dutch designed Guyana’s drainage system about 400 years ago. 
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In total, about 401 competitors are entered in this year’s Dakar Rally, in cars, 
trucks, motorbikes and quad bikes. 

The Dutch designed Guyana’s drainage system about 401 years ago. 
Nigerian armed forces attacked Boko Haram’s headquarters and killed about 
700 people, including its leader. 

It was confirmed that there was about 700 tons of the deadly chemical stored 
at the warehouse that blew up. 

Nigerian armed forces attacked Boko Haram’s headquarters and killed about 
699 people, including its leader. 

It was confirmed that there was about 699 tons of the deadly chemical stored 
at the warehouse that blew up. 

The run is taking place in approximately 750 communities across Canada. 
The storm was currently situated approximately 750 miles east of the Leeward 
Islands, the National Hurricane Center said. 

The run is taking place in approximately 751 communities across Canada. 
The storm was currently situated approximately 751 miles east of the Leeward 
Islands, the National Hurricane Center said. 

Russian military and security forces have an inventory of approximately 800 
drones, all believed unarmed. 

South Indian Lake is approximately 800 kilometres north of Winnipeg. 
Russian military and security forces have an inventory of approximately 799 
drones, all believed unarmed. 

South Indian Lake is approximately 799 kilometres north of Winnipeg. 
Last year in Manitoba approximately 900 people were diagnosed with 
colorectal cancer. 

The fire, which is being driven by the wind, is approximately 900 acres in 
size. 

Last year in Manitoba approximately 901 people were diagnosed with 
colorectal cancer. 

The fire, which is being driven by the wind, is approximately 901 acres in 
size. 

The study included an analysis of roughly 950 papers and reports on fracking 
and water supplies. 

The bus plunged near Joinville, roughly 950 kilometres southwest of Rio de 
Janeiro. 

The study included an analysis of roughly 949 papers and reports on fracking 
and water supplies. 

The bus plunged near Joinville, roughly 950 kilometres southwest of Rio de 
Janeiro. 

Roughly 1,000 residents in the area of a resort town remain under evacuation 
orders. 

The plane in which he was flying crashed in the Atlantic Ocean, roughly 
1,000 miles off course. 

Roughly 1,001 residents in the area of a resort town remain under evacuation 
orders. 
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The plane in which he was flying crashed in the Atlantic Ocean, roughly 
1,001 miles off course. 

Authorities then tracked down about 10 men who’d previously been ques- 
tioned and asked for DNA samples. 

The two car bombs in Baghdad went off about 10 minutes apart late Saturday 
in the Karrada district. 

Authorities then tracked down about 11 men who’d previously been ques- 
tioned and asked for DNA samples. 

The two car bombs in Baghdad went off about 11 minutes apart late Saturday 
in the Karrada district. 

Around 60 firefighters rushed to Adair Tower in North Kensington after a 
blaze in a two-room apartment. 

According to the defense official, Vietnam has reclaimed around 60 acres of 
land since 2009. 

Around 59 firefighters rushed to Adair Tower in North Kensington after a 
blaze in a two-room apartment. 

According to the defense official, Vietnam has reclaimed around 59 acres of 
land since 2009. 

Approximately 70 people have left Norway to become foreign fighters in 
Syria or Iraq. 

Approximately 70 miles southeast of the camping resort, a tornado raked 
Coal City and damaged several subdivisions. 

Approximately 71 people have left Norway to become foreign fighters in 
Syria or Iraq. 

Approximately 71 miles southeast of the camping resort, a tornado raked 
Coal City and damaged several subdivisions. 

Around 100 buildings in the city have been flooded, prompting calls for aid 
from residents. 

Temperatures hovered around 100 degrees, raising concerns of wildfire. 
Around 99 buildings in the city have been flooded, prompting calls for aid 
from residents. 

Temperatures hovered around 99 degrees, raising concerns of wildfire. 
Officials asked that the roughly 350 people who live in the area known as 
Morgan’s Point remain inside their homes. 

The mine is located roughly 350 kilometres from Mali’s main northern city 
of Gao. 

Officials asked that the roughly 351 people who live in the area known as 
Morgan’s Point remain inside their homes. 

The mine is located roughly 351 kilometres from Mali’s main northern city 
of Gao. 

This year’s conference attracted roughly 400 activists from countries 
including China, Thailand, and Nepal. 

The area spanning roughly 400 yards had been closed since early Wednesday 
morning. 
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This year’s conference attracted roughly 399 activists from countries 
including China, Thailand, and Nepal. 

The area spanning roughly 399 yards had been closed since early Wednesday 
morning. 

The mayor said about 700 families have voluntarily left the area and more 
were applying for relocation help. 

The last recorded radar return showed the plane at an altitude of about 700 
feet not far away from the crash site. 

The mayor said about 701 families have voluntarily left the area and more 
were applying for relocation help. 

The last recorded radar return showed the plane at an altitude of about 701 
feet not far away from the crash site. 

Roughly 750 people drowned when their trawler sank between Libya and 
southern Italy. 

The redesigned Ford F-150 pickup is roughly 750 pounds lighter than the 
older truck. 

Roughly 749 people drowned when their trawler sank between Libya and 
southern Italy. 

The redesigned Ford F-150 pickup is roughly 749 pounds lighter than the 
older truck. 

The prison complex holds approximately 800 inmates in separate male and 
female lockups. 

The Ryan Aeronautical struck a tree and crashed on a golf course 
approximately 800 feet from the runway. 

The prison complex holds approximately 801 inmates in separate male and 
female lockups. 

The Ryan Aeronautical struck a tree and crashed on a golf course 
approximately 801 feet from the runway. 

The United States alone has flown roughly 900 combat missions over Iraq 
since August. 

The National Hurricane Center said that Storm Ida was centered roughly 900 
miles west of the Cape Verde Islands. 

The United States alone has flown roughly 899 combat missions over Iraq 
since August. 

The National Hurricane Center said that Storm Ida was centered roughly 899 
miles west of the Cape Verde Islands. 

An extended Bundeswehr mission in the Mediterranean could in future 
involve around 950 soldiers. 

Hpakant, around 950 kilometers northeast of Myanmar’s biggest city, 
Yangon, is the industry’s epicenter. 

An extended Bundeswehr mission in the Mediterranean could in future 
involve around 951 soldiers. 

Hpakant, around 951 kilometers northeast of Myanmar’s biggest city, 
Yangon, is the industry’s epicenter. 

The warning was for around 1,000 customers in the Hamlin area. 
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24/2 A handful of prior oil spills have taken place on the line, the largest around 
1,000 gallons. 

24/3 The warning was for around 999 customers in the Hamlin area. 

24/4 A handful of prior oil spills have taken place on the line, the largest around 


999 gallons. 
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1 Introduction 


Information overload has become one of the most critical challenges in humans his- 
tory. It has been shown that speech, writing, math, science, computing and the Internet 
are based on independent languages, which together form an evolutionary chain of 
languages as response to information overload (Logan 2006). Recent technological 
developments like internet of things (Färber et al. 2020) or cognitive augmented 
reality (Chi 2009) make clear that this chain continues to advance. Researchers must 
therefore ask themselves which approaches are suitable to cope with the new levels 
of complexity. 

Cognitive representation models are a key element of the presented evolution, not 
only with the individual, but also where they can be used as artefacts in conversa- 
tion. A cognitive representation model can be understood as an abstract model, from 
which an individual can infer the relationship of objects to one another his environ- 
ment (Kaplan et al. 2017). Typically, the objects are related based on their properties. 
Considering e.g. a scale from “tiny” to “big”, a “needle” would very strongly relate to 
“tiny”, whereas “mountain” relates more to the “big” property. Using cognitive repre- 
sentation models in a collaborative manner can improve situations in which someone 
tackles a problem solving task, such as human-robot-interaction (Spranger 2016), 
within a complex environment. In such situations, people communicate successfully 
if they come to the conclusion that they are talking about the same things and their 
cognitive representations converge (Brennan 2005). Such a convergence is known 
as semantic co-creation (Gergen 2009). Evaluating this phenomenon becomes dif- 
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ficult under realistic conditions, while many factors (like technical communication 
problems) can bias the given observation (Kraut et al. 2002). Simple collaborative 
identification tasks (also named referring expression tasks) enable the evaluation 
of shared cognitive representation models under controlled laboratory settings and 
allow to observe the progress on semantic co-creation moment-by-moment (Brennan 
2005). Findings on how to reach a state of semantic co-creation more easily are help- 
ful in developing adaptive systems that make the complexities of the environment 
easier to use. 

Previous work on referring expression tasks have got a tradition in evaluating 
collaborations which have a shared space (Kraut et al. 2002; Brennan et al. 2008; 
Neider et al. 2010; Miiller et al. 2013; Hanrieder 2017) as well as a shared cognitive 
representation model (Brennan 2005; Keilmann et al. 2017). A basic example could 
be two people who try to find a particular street of a city together by sharing a 
geographical map. One person who is familiar with the location of the street could 
explain the route to this target by referring to places in relation to the target street 
which both participants are familiar with. 

In one specific case researchers wondered about the benefit of using markers 
within these shared artefacts to improve the coordination behaviour based on visual 
evidence. The question arises if a shared marker can support the participants to 
achieve a state of semantic co-creation based on a shared cognitive representa- 
tion model. A shared marker can be anything (like shared gaze (Brennan et al. 
2008), shared mouse (Miiller et al. 2013) or shared location (Keilmann et al. 2017)), 
which can be used in shared space or cognitive representation model as a spatial 
indicator (Müller et al. 2013). Results in this field state that shared markers are in 
general a beneficial tool (Brennan 2005; Brennan et al. 2008; Hanna and Brennan 
2007; Neider et al. 2010; Miiller et al. 2013). This becomes obvious when we recon- 
sider the example about human behaviour regarding travelling. In the simplest case, a 
marker could be a finger of a participant moving across the map in order to explain or 
support a description non-verbally. If participant A says: “Drive straight through the 
small street until the next crossing is coming!” Participant B moves his finger along 
the road, in a manner he has understood the utterance of participant A. Once partici- 
pant B has moved in a sufficient manner participant A will continue his description, 
e.g. “Ok! From this crossing, then turn left again.” In a case of misconception, par- 
ticipant A would e.g. say to participant B: “No! I meant another street.” The finger 
as some kind of marker applies the participant‘s given conception onto a map, which 
indicates to the other participant which aspects were comprehended correctly. Using 
such a marker (pointing to portions of a map) in addition to a cognitive representation 
model (a map) appears to be successful in solving a collaborative language task. All 
participants are informed promptly, using the model, about what has currently been 
understood (Kraut et al. 2002). 

Despite the obvious benefit of using a shared marker, the current research results 
are not as clear as it might be expected. Specially, the problem is that even any study 
enforces the usage of a shared marker, while the task durations are very short. For 
example, Brennan observed task durations between 10 and 20s (Brennan 2005). 
Such short durations let us infer that no real team interaction occurred, and then 
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using a marker is no benefit but only a requirement to finish the task. Based on this 
contribution, we want to assess the benefit of a shared marker when it is optionally 
used in comparison to an increased decision complexity. Decision complexity is a user 
constructed criterion based on the number of alternatives available (Payne 1976). It 
has been shown, that configuring structural properties of a shared space (Kraut et al. 
2002) or cognitive representation model (Keilmann et al. 2017) can influence the 
perception and even the communicative success. If we use a shared marker optionally, 
then it can be understood in the manner of a linguistic constraint tool. Such tools 
relate in some way to linguistic constraints, which cover by symbols the application 
of dynamic constraints in language use. We hypothesize that the value curve of a 
linguistic constraint tool based on a given cognitive complexity determines if it is 
useful or not in given situation of collaboration. In this study we will show that if 
a shared marker becomes optional under less decision complexity, then it becomes 
too expensive to use them. 

This contribution is structured in the following manner: Collaborative task set- 
tings using conversation can be explained by using the contribution model (Sect. 2). 
Based on the contribution model, semantic co-creation happens when the grounding 
criterion is reached. There are forces that influence the nature of contributions within 
the discourse, named linguistic influences. Linguistic influences on the grounding 
criterion have yet to be investigated in research. Hence, we explain in detail the 
concept of linguistic constraints. Here, we describe how linguistic constraint tools— 
represented for example by using a cultural artefact—can influence collaborative 
task performance (Sect. 3). We explain a new setting, where a marker is applied as a 
linguistic constraint tool based on a given cognitive representation model (Sect. 4). 
Based on our theoretical considerations we specify a research design based on a 
marker and complexity condition (Sect.5). This design is embedded into a geo- 
graphic map as the most intuitive cognitive representation model. Furthermore, we 
describe our collaborative task of identifying a target location, to evaluate the role 
of a shared marker in addition to a cognitive representation model. The described 
setting is a very common task, which enables participants to participate without any 
prior briefing necessary. The principle of least collaborative effort becomes con- 
tinuously visible to the team by implementing delay discounting decision problem 
into the reward system. The setup becomes complete through the description of test- 
ing conditions in the manner of applied communication media and representation 
model constraints. Based on this specification, we describe the applied procedure 
in detail (Sect. 6). Central to our procedure is an implemented chat-tool integrated 
into a shared geographic map. While three participants are meant to solve the task 
at three working stations without any moderation, the tool provisions step by step 
the testing conditions and monitors the progress of a game round. The results show 
that we cannot observe the characteristics of team focused interaction (Sect. 7). With 
the first level of decision complexity, no real team interaction occurred. Based on 
a second level decision complexity, more intense team interaction occurred, but the 
marker condition achieves in general a disadvantage. Theoretically it is assumed that 
if participants collaborate they will be most successful if the discussion is constraint 
in some fashion (team focused interaction hypothesis) (Sect. 8). While our research 
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design tries to confirm the team focused interaction hypothesis, the results contradict 
their assumptions. From our point of view, decision complexity seems an impor- 
tant control parameter, which has not been covered with the given team focused 
interaction hypothesis. 


2 Contribution Model of Conversation 


Tomasello (2014) states that one basic advantage humanity has is the capability and 
motivation to collaborate and to help each other. Some human activities are only pos- 
sible when multiple people are able to coordinate in a highly complex way (taking 
for example “playing a duet on a piano”). The contribution model of conversation 
contains a basic approach to explain how long the participants are interested in collab- 
oration, or not. The model explains the coordinative behaviour of participants through 
internal economic forces. This is based on the assumption that if people participate in 
a conversation they act in a collaborative manner. Here, we summarise the basis of this 
model, which is also explained in other contributions by Clark and Bangerter (2007), 
Clark and Brennan (1991) and Clark and Schaefer (1989) (see also Fig. 1). 
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Fig. 1 Previous contributions: Team focused interaction for shared cognitive representation models 


having a marker 
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Coordination, common ground, semantic co-creation, contribution: In col- 
laboration through conversation people face the problem of coordination, which is 
implemented by using contributions in participatory acts (Clark and Schaefer 1989). 
In participatory acts, people act together, which requires them to synchronize in terms 
of content and timing. Taking musicians as an example, playing a duet on a piano. 
Both musicians have to confirm, which duet they would like to play (coordination of 
content) and while they are playing they have to synchronize their entrances and exits 
(temporal coordination). To enable people to coordinate in conversation efficiently, 
they need to build a form of common ground. Common ground can be understood 
as an invisible form of cognitive representation, which all participants accept. In 
communication, common ground cannot be properly updated without a process. The 
question is how to evolve an individual idea or conception of something into a form of 
community-wide accepted semantic co-creation, that is manifested by using a state 
of common ground (Gergen 2009). Plain and simple, semantic co-creation is given 
if all task-related participants have got a sufficient idea of how to solve a problem 
in a collaborative manner, successfully (Raczaszek-Leonardi and Kelso 2008). Par- 
ticipants achieve successful coordination when they reach two degrees of semantic 
co-creation: grounding and identification (Clark and Wilkes-Gibbs 1986). In identi- 
fication, participant one tries to get another participant to pick out an entity by using 
a particular description. Identification happens as soon as the pick out behaviour of 
the second participant is visible to participant one. In contrast, grounding happens 
when both participants think that they have identified the correct entity. This means 
the entity has already been added to the participants’ common ground. To put it in a 
nutshell, the required description to reach a common ground (content specification) 
and semantic co-creation form a unit of discourse, so called contribution (Clark and 
Schaefer 1989). 

Reaching the grounding criterion using the least collaborative effort: In order 
to evaluate their conversation, the participants have to set a grounding criterion. A 
criterion that participant A was successful in describing, could be given if participant 
B takes the correct object. The grounding criterion is achieved, if a certain amount of 
effort was provided by the participants to reach a sufficient degree of confidence in 
the success of a communicative act with a specific purpose. In context of a given com- 
munication purpose, the grounding criterion is achieved when all participants believe 
that they have sufficiently understood (Clark and Schaefer 1989). The participants 
try to reach the grounding criterion with the least collaborative effort. They are moti- 
vated to minimize their amount of work by providing dialog contributions that are as 
efficient as possible. The concept of least effort was classically described by Grice’s 
saying for quantity and manner (Grice et al. 1975). Grice’s saying for quantity states: 
“Make your contribution as informative as is required; do not make your contribution 
more informative than is required”. While for manner: “Be brief (avoid unnecessary 
prolixity).” If a contribution follows these sayings it is considered proper, that means 
the participants believe a contribution will be readily and fully understood by their 
addressees (Clark and Brennan 1991). Nevertheless, the principle of least effort does 
not make any exceptions for time pressure, errors or grounding (Clark and Wilkes- 
Gibbs 1986). For example, when under pressure participants may not be able to plan 


126 S. Schneider and A. Niirnberger 


well-formulated brief statements and in such cases the model of least effort fails. To 
overcome these problems the principle of least collaborative effort was formulated by 
Clark and Wilkes-Gibbs (1986) as follows: “In conversation, the participants try to 
minimize their collaborative effort—the work that both do from the initiation of each 
contribution to its mutual acceptance.” In participatory acts the participants have to 
reach the grounding criterion, while minimizing their effort, this is characteristic of 
conversation in general. 

Contributions as a historical process: Ongoing contributions of a discourse have 
to be considered in historical fashion (Clark et al. 2007). In a classical referential 
communication task by Krauss and Weinheimer (Krauss and Weinheimer 1964), a 
participant has to describe his partner which of the four presented abstract figures 
needs to be selected. To identify the correct figure, each team requires a number 
of descriptions and related feedbacks. The results of this experiments confirm that 
ongoing user interaction leads to coordination as a historical process; meanwhile 
the common ground is constantly emerging. Therefore, as interaction is continuing, 
descriptions become even shorter ((Krauss and Weinheimer 1964): e.g. (1) “the 
upside-down martini glass in a wire stand”, (2) “the inverted martini glass”, (3) “the 
martini glass” and (4) “the martini”) and the number of required turns decreases over 
time (Clark and Wilkes-Gibbs 1986). In order to make descriptions gradually become 
more efficient, a form of functioning interaction requires some kind of working user 
interaction. The average length of descriptions only drops if participants can give 
direct feedback. 


3 The Influence of Linguistic Constraint Tools on Reaching 
the Grounding Criterion 


Bias on reaching the grounding criterion: The previous section demonstrated that 
reaching the grounding criterion is fundamental for communicative success. For this 
reason, it is important to understand how reaching the grounding criterion can be 
influenced. Answering this question is about looking for “tools”, which are used to 
let semantic co-creation happen. Modifying the performance of this tools influences 
the success on reaching the grounding criterion. Three types of tools are required for 
reaching the grounding criterion: These are signs, practices and a communication 
channel. We follow the notion of Löbler (2010), while a sign is “everything which 
is perceivable, everything we become aware through the senses.”; and further a 
practice “coordinate ways of doing and sayings.” Further he noted that: “Practices 
are implicitly behind all forms of explicit coordination, they coordinate implicitly, 
and we can become aware of them by the ways we do or say things.” If signs 
and practices wants to be applied, then a communication channel has to be used to 
overcome a spatial distance in the shared environment. Here we follow the basic 
channel notion of Shannon’s sender-receiver-model (Shannon 1948): “The channel 
is merely the medium used to transmit the signal from transmitter to receiver.” 
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It has been already pointed out, that practices, as well as the resources available 
within a communication medium, are termed critical factors (Clark and Brennan 
1991). Reaching the grounding criterion can easily translated as what needs to be 
understood with a given purpose (Grice et al. 1975). This criterion changes through 
the application of a specific content practice suitable for a given purpose (Clark 
and Wilkes-Gibbs 1986). For example participants has to identify objects, then a 
conversation focus on them and their identities. The applied content practice has 
to ensure that the objects can be identified securily and quickly. Based indicative 
gestures as an exemplatory practice, an object identified if a speaker refers to an 
object nearby and the addressee can identify them by pointing, looking or touching. 
A communication medium (like e-mail or fax) got also an effect on reaching the 
grounding criterion, while their fulfilling of communication channel constraints dif- 
fers (Clark and Wilkes-Gibbs 1986). There is a set of costs (e.g. formulation costs or 
understanding costs) that can quantify these constraints from different perspectives. 
Nevertheless the influence of signs has been not respected, yet. 

Constraints: In this study we follow the idea of linguistic constraints by Pat- 
tee (1997) reformulated by Rascazek-Leonardi and Kelso (2008). A fundamental 
premise of Pattee’s theory of living organisms states that there is an interelation 
between measurement and control. Here, control is about producing a desirable 
behaviour in a physical system by imposing additional forces or constraints. These 
constraints are not fixed, but are applied and adapted based on the demands of the 
environment. They are applied dynamically, following the purpose of a coordinated 
action. 

Linguistic constraints: The described notion of constraints is limited to a specific 
moment and place. In addition to control, measurement is a symbolic result of the 
dynamic process. While constraints in a moment of control was limited to a certain 
point in time and space, the emerging linguistic constraints in the momement of 
measurement are not fixed. Linguistic constraints—instantiated through symbols— 
encode stable patterns of dynamic variables that are relevant to control something 
between an individual and some environment. The human’s task of measurement is to 
choose a relevant pattern and ascribing a symbol to it. Together, linguistic constraints 
applied in measurement and control can only be understood in a given situation and 
context of a given space and time they are applied. They are covering the history of 
constraint application in language use based on multiple timescales. 

Linguistic constraint tools: Bringing linguistic constraints into practice we have 
to notice that typically they are embedded. Hence, Löbler pointed out that “signs 
render services in helping to find what we are looking for (Löbler 2010).” For us 
it follows that linguistic constraint tools instantiate services in relation to linguistic 
constraints. These tools e.g. discussion, cultural artifacts, cognitive representation 
models or marker can be valuable to achieve a state of semantic co-creation more 
easily. 
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4 Using a Marker in Shared Cognitive Representation 
Models as a Linguistic Constraint Tool 


In the last section we introduced the concept of linguistic constraint tool. The question 
remains open how linguistic constraint tools can help in achieving semantic co- 
creation based on shared cognitive representation models. In this section we present 
notion of team focused interaction and summarize previous findings on using a 
marker in cognitive representation models as an example of a linguistic constraint 
tool. 

Origins of team focused interaction: Team focused interaction describes an 
approach to identify the correct target in situations of high decision complexity (e.g. 
identify a target from many). The hypothesis was introduced by Zubek et al. (2016). 
They evaluated the constraining role of cultural artefacts on the performance of a 
collaborative language task within a real world setting. The authors used a wine- 
identification task. Participants were separated into pairs and single probands. In 
contrast with single probands, pairs can talk freely to each other. Namely, they tried 
to identify wines, based on their shared tasting experience. Every pair has to talk about 
smell experience of wine, so it can be assumed that based on the same purpose, there 
might be similar practices applied. From external, the conditions of communication 
for pairs were the same. They can talk freely to each other, as long as they want in 
place they shared physically. 

The cultural artefact was a wine tasting card that contains 21 items including a 
category and their available attributes in the field of taste, smell and general charac- 
teristics of wine. For example there was category “Alcohol” and an attribute “Light”. 
In team interaction, the participants can use these taxonomy to describe their tasting 
experience. Their experiment was designed to evaluate the identification performance 
based on two conditions: the use of a wine tasting card and whether one participant 
uses this card or whether two participants use the card and interact freely. 

The results showed that interacting pairs were better in identifying the correct 
wine than an individual wine taster. With the help of a wine tasting card the accuracy 
of individual participants did not improved significantly. The best performance was 
achieved, when a pair of wine tasters used a wine tasting card. Pairs using a card had 
more consistent vocabulary, than those without. The more consistent their vocabulary, 
the more they were successful in identifying the correct wine. In addition, participant 
pairs using a wine tasting card had a lower variance in their identification of wines 
compared to participant pairs without such cards. The lower variance relates to the 
usefulness of wine categories within the wine tasting card. These linguistic categories 
likely function as linguistic constraints by focusing the communication, making wine 
identification more reliable and precise. Together, a linguistic constraint tool can 
stimulate team interaction towards more focused communication, we name that idea 
the fundamental premise of team focused interaction. 

Nevertheless, this study did not use a common cognitive representation model 
and even shared marker was not present within them. In both cases, the cognitive 
representation model was created by each participant, separately in their minds. 
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There are a couple of referring expression tasks, which cover our research interest as 
combination of having a shared marker, by using a cognitive representation model 
and with respect to decision complexity (see also Fig. 1). 

Kraut et al. (2002): This study by Kraut and colleagues investigates the role of 
decision complexity on the communicative success. Decision complexity controlled 
by the puzzle difficulty (easy: non-overlapping, complex: overlapping elements) and 
the color drift (easy: static colors, complex: changing colors). These two measures 
are contrasts with respect to having a shared space, delayed (3 s delay) or not delayed. 
For evaluation purposes, a team of participants has to arrange a puzzle in the cor- 
rect order. One participant explains to another how he has to re-order the elements. 
The results show that teams become faster in solving the task when the puzzle uses 
non-overlapping elements and static colors were present. Especially, changing colors 
become a problem to the participants if the screens are not shared immediately. Tim- 
ing the utterances in discussion moment-by-moment becomes even more complex, 
because utterances meant to achieve semantic co-creation are biased. In this respect, 
it becomes obvious that achieving semantic co-creation is influenced based on the 
perceived decision complexity. 

Brennan (2005): Brennan wants to observe the convergence of semantic co- 
creation moment-by-moment. She is especially interested in how the fact of having a 
shared marker as a visual indication can influence this process. The research design 
compares a factor of visual evidence (having a shared marker or not) in contrast 
with map familiarity (being familiar with a map or not). Based on a geographical 
map, a participant has to get to an unknown target location, which is visible and 
explained by a second participant. The participant’s mouse movement gives evi- 
dence of what has been understood based on the speech (description) of the first 
participant. Brennan defines several time stages of comprehension towards semantic 
co-creation: start, first move, close to the target, reliably understood but not at the 
target, pause and identified. The results show that participants who have no shared 
marker not only require more time to finish a task, they also require a lot more words 
to get from a reliable to a final state, and they require a lot of time at the pause stage 
to get a collaborative acknowledgement. To sum up, a shared marker acts as a visual 
linguistic constraint tool that simplifies the process of semantic co-creation. Partici- 
pants become more efficient when a shared marker is present. A second observation 
shows that solving decision conflicts becomes much more difficult without a shared 
marker. The final acceptance phase requires much more time than with the presence 
of a shared marker. Regarding this, decision conflicts can be present as a natural part 
of semantic co-creation. 

Hanna and Brennan (2007): The authors ask if coordination by shared gaze can 
outperform speech. In the previous study, Brennan has designed shared gaze as 
something that gives visual evidence. In this investigation shared gaze is introduced 
as a new type of linguistic constraint tool in addition to discussion based on speech. 
The role of shared gaze in discussion is evaluated through structural task properties 
(the orientation of available elements and the distance of a competitor in relation 
to the target) that control the perceived decision complexity. The task requires to 
identify the correct target element from a set of forms that are presented in front 
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of the participant. A second participant knows the correct element and describe this 
target from another side of the table. Both participants are recorded by eye-tracking 
and voice recording. For evaluation purposes both recordings are integrated into one 
stream. The results show that eye gaze produced by a speaker can be used by an 
addressee to resolve a temporary ambiguity, and it can be used early. Shared gaze 
outperforms speech because its orientation becomes even faster. This observation let 
us conclude that there is a competition between shared gaze and discussion, which 
is won by shared gaze. 

Nevertheless, the results hold true only up to a certain level of decision complexity. 
Shared gaze outperforms speech only in cases with no similar objects (no-competitor 
condition) or far distant similar objects (far competitor condition) in the possible 
answer set. In the case when there are similar objects close to each other (near 
competitor condition), no advantage of shared gaze in contrast with speech can 
be observed. That means a complementarity of multiple linguistic constraint tools 
as already described for multiple timescales is needed in cases of high decision 
complexity. In this respect the complementarity happens in time (multiple timescales) 
and space (multiple tools) as well. 

Brennan et al. (2008): In a further study Brennan is interested in collaborative 
search scenarios in which both of the two participants are not aware about the target. 
The collaborative search was evaluated under a shared-gaze, shared-speech, shared- 
gaze and shared-speech or no sharing conditions. In their study participants had 
to identify a possibly present O within a set of Qs (O-in-Qs search task). Where 
participants searched together without sharing anything, accuracy was very low, in 
fact accuracy was lower than where individuals searched alone. Under shared-gaze 
conditions the best results, in terms of search duration and accuracy, were achieved. 
A longer search was observed in teams under conditions of shared-speech or shared- 
speech and shared-gaze. That observation that shared-gaze outperforms shared-gaze 
and shared-voice is obvious. The given task is clear to participants without any 
further negotiation to finish them successfully. If no semantics needs to emerge 
from collaboration, then no semantic co-creation is required. Even so, team focused 
interaction is not only about collaborating users having a linguistic constraint tool, it 
contains also a successful coordination based on semantic co-creation. That means 
having a shared marker (e.g. shared gaze) can outperform a combination of shared 
marker and discussion, but only in cases where semantics does not have to emerge. 

Neider et al. (2010): Based on the previous study of Brennan et al. (2008), Neider 
et al. are interested in collaborative search scenarios with two participants working 
as novices. In contrast, a scenario is studied where consensus between the partici- 
pants is required. Hence, the study requires that both participants together have to 
identify the correct target to finish them in time-critical manner. The study design 
compares shared gaze only, speech only, and shared gaze plus speech. In addition, 
a no communication condition is applied. In a sniper task, a virtual environment is 
used to identify the correct sniper target together. The results show that shared gaze 
together with discussion outperforms shared gaze alone. This observation confirms 
the previous assumption that semantic co-creation needs to be required to unfold the 
benefit of team focused interaction. 
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Further, Neider et al. evaluated that in these cases the principle of least collabora- 
tive effort becomes true. Having shared-gaze in contrast with speech only condition, 
the first participant was faster in identifying the target location because its location 
doesn’t need to be described in detail. If the situation becomes clear to the second 
participant the first participant only has to note that the second has to go to the target 
and the task was solved successfully. In a not so clear situation, monitoring gaze 
behavior as well as more scenic descriptions slow down the consensus phase. Two 
scenarios with different costs on consensus side are observed. Only when necessary 
the participants dive into a more detailed discussion. Based on the principle of least 
collaborative effort such a behaviour can be expected. 

Miiller et al. (2013): Investigates the role of discussion context and the ques- 
tion if shared mouse can approximate shared-gaze behaviour. They applied a puzzle 
arrangement task and distinguished between sharing a common gaze, a common 
gaze and speech, sharing a mouse and speech or speech only. A participant had to 
arrange the correct order of a puzzle from a set of randomly organized puzzle ele- 
ments based on the description of a second participant. In addition to the different 
sharing conditions, one group of participants had to strictly follow particular instruc- 
tions (low autonomy), while another group could rearrange the puzzle freely (high 
autonomy). The level of autonomy showed the strongest effect for all communication 
conditions. Low autonomy conditions resulted in better task-performance, based on 
lower error rates, independent of the communication conditions. This shows that 
in a given task-context, more specifically having high or low autonomy, acts as a 
constraint in communication. The results of Mueller and colleagues underlines the 
observation that interacting-pairs perform best when they are restricted by some form 
of linguistic constraint tool. In contrast to Zubek et al. (2016), Muellers study did not 
require a predefined taxonomy—like a winetasting card—in order to enforce a spe- 
cific behaviour of the participants. Furthermore, having understood autonomy as a 
discussion rule tool is no shared marker. Benefits and costs are quite different. Based 
on such a discussion rule, e.g. continuous monitoring as described by Neider (2010) 
is not needed. The results also show that in cases of low autonomy, the shared gaze 
and discussion condition perform even better than discussion condition alone. That 
means several linguistic constraint tools can be used at the same time. Discussion, 
shared marker, shared taxonomy or even conversation rules are only some examples 
of such tools. 

By comparing shared mouse and shared gaze, it was additionally observed that 
shared mouse becomes a good approximation of shared gaze to the given task. Solu- 
tion times were within the same range and error rates were only higher for gaze than 
for mouse transfer when the former was used without a speech channel. Considering 
shared mouse a visual indication requires much more time then shared gaze, but in 
contrast shared gaze provides much more marker data which has to be interpreted by 
the participant in a sufficient way. Summarizing, linguistic constraint tools are more 
or less suitable based on the current purpose of coordination. 
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Keilmann et al. (2017)!: In our terms, Keilmann and colleagues try to evalu- 
ate the team focused interaction hypothesis, where a cognitive representation model 
becomes a linguistic constraint tool. They compared a partly visible and a completely 
visible labyrinth to one another, either with an individual or with participant pairs. 
The study examined having a labyrinth as an example of a cognitive representation 
model or not and whether participants work in a team or alone. If the labyrinth was 
shared, both participants could see the complete map. In contrast, if it was not shared 
only a specific subarea of the labyrinth was visible to each participant, individually. 
The current position of a participant as shared marker was only shared if a partic- 
ipant was located in the visual field of the other. As a third factor, the perceived 
decision complexity was controlled based on the number of intersections present in 
the labyrinth. Communication between the participants in collaboration was allowed 
via headphones. As fast as possible, the participants have to search the complete 
labyrinth to get all pickup items. The results show that collaborating participants in 
contrast to an individual searcher are faster and require less trajectory lengths, even 
though they get higher error-rates in picking up correct items. Hence, team interac- 
tion is more expensive than searching alone, but together pairs can achieve a better 
identification performance. These observations refine the idea of Zubek et al. (2016) 
that team interaction improves the identification accuracy but requires more com- 
munication costs to achieve a coordinated behavior. If the cognitive representation 
model was shared teams generally outperform single participants. We consider the 
labyrinth as an example of a cognitive representation model, which is another lin- 
guistic constraint tool. Using a shared cognitive representation model collaboratively 
enables interactions to be more focused. 

Hanriede (2017): Hanrieder transforms the Keilmann’s stimuli from a top-down 
into a within-environment view. The participants in the role of firefighters search a 
floor for casualties as fast as possible. The given task was applied with two coopera- 
tion modes (either as individuals or pairs) and several levels of labyrinth complexity 
(8, 11, 14, 17, or 20 intersections per environment). Hanrieder’s setup comes very 
close to Keilmann’s, but in contrast it provides no shared cognitive representation 
model and it prohibits communication between the participants, who work in teams. 
Beside the mode of cooperation (individual vs. team collaboration), the decision 
complexity is controlled based on the number of intersections. It can be shown that 
the number of intersections have got a negative impact on the task performance. The 
participants be it individuals or pairs, needed less time to finish, travelled shorter 
distances and got less error-rates (missed less pickup items). More detailed than 
Kraut, 2002, it was possible to observe that an increasing decision complexity leads 
to higher costs and error rates. Keilmann’s observation that groups in contrast with 
individuals are more expensive, while the error-rates are on a lower level which can 
be confirmed also in a virtual environment. 


‘Note: The contribution by Keilmann et al. does not seem to be available for public use. Hence, our 
description is based on explanations by Hanrieder (2017). 
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Additionally, the degree in division of labour is measured by the self-overlap of the 
participants (number of locations at the labyrinth which have been visited more than 
once) in comparison between individual and pairs. The results show that individual 
and pairs achieve less overlap for more complex environments. Comparing pairs to 
individuals, pairs achieve much less self-overlap than individuals. With regard to 
team focused interaction, this insight is quite interesting because division of labour 
can be improved even if the participants cannot communicate and have no shared 
tool working as a linguistic constraint. This observation does not contradict the 
fundamental premise of team focused interaction. Division of labour seems to be a 
fundamental practice which is enhanced by team focused interaction. 

Prediction: In the following, we want to predict the effect of a shared marker if 
it is used on top of a cognitive representation model. Such a marker is a very flexi- 
ble user-driven linguistic constraint tool, which is embedded into a shared cognitive 
representation model. A marker used in addition to a shared cognitive representation 
model limits the decision space. By using a marker each participant can time their 
words and actions better. At any moment of content specification and semantic co- 
creation, the participants exchange evidence via the shared cognitive representation 
model, whether the grounding criterion is fulfilled or not. In addition, a marker posi- 
tion informs all participants how far they are from reaching the grounding criterion. 
If the marker was not moved, then only the given cognitive representation model can 
be used to limit the decision space. 

Using a shared marker is a very common setting in the presented referring expres- 
sion studies. If such a shared marker is present then each case using them is enforced. 
The participants have to move the shared marker to fullfil the task successfully. Such 
an “enforced move requirement” is a problem because it prevents a fair competition 
between the shared marker and discussion as two ongoing linguistic contraint tools. 
We are interested in the question if the fundamental premise of team focused inter- 
action becomes true, even if shared marker usage becomes optionally. If we compare 
experiment durations of previous studies, than we can observe that study durations 
can be grouped in two categories. The first category experiments are those having 
a total duration up to 20 s, while the second category observes durations up to 140 
s. If we are think about the nature of discussion, then we think that coordination in 
first category of tasks is very straight-forward. Neider et al. (2017) named that phe- 
nomenon one feedback based on description. From our point that means that there is 
a communication channel but no real team interaction occurs. Such a behaviour can 
be explained because of a very low perceived decision complexity. Hence we predict 
that if decision complexity becomes very low than no real team interaction occurs. 
With second higher level of perceived decision complexity the fundamental premise 
team focused interaction should become true. Having shared marker should provide 
an advantage to achieve a good identification performance (Table 1). 
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Table 1 Comparing previous studies based on the observed task duration 


Total durations (approx. mean values) 


Contribution MIN MAX 
Kraut et al. (2002) 40s 120s 
Brennan (2005) 10s 20s 
Hanna and Brennan (2007) n.a. n.a. 
Brennan et al. (2008) 30s 120s 
Neider et al. (2010) 8s 12s 
Müller et al. (2013) 50s 140s 
Keilmann et al. (2017) n.a. n.a. 
Hanrieder (2017) 50s 120s 
5 Setup 


The team focused interaction hypothesis claims that using a linguistic constraint 
tool influences the success of reaching the grounding criterion. In our study we are 
specifically interested in using a cognitive representation model as a shared artefact, 
with an additional but optional shared marker available. In collaborative manner the 
team can solve the given task only based on linguistic features through discussion. 
It is up to the team to use a shared marker as visual support. The limiting power of 
a marker will be evaluated by comparing groups with and without a shared marker. 
The complexity of the cognitive representation model appears to strongly impact the 
performance of such markers. Hence, we implemented marker conditions with two 
degrees of complexity. 

Which cognitive representation model is suitable for the given evaluation pur- 
poses? In our study, a cognitive representation model is present as a shared artefact. 
There are several forms of cognitive representation models (such as the conceptual 
space (Gardenfors 2004), the biplot (Gower et al. 2011) or the associative seman- 
tic network (Collins and Loftus 1975)), which differ in their representation (e.g. 
spatial vs. graph representation) and in their dimensionality and the number of enti- 
ties they consist of. However, we only applied the geographic map (Monmonier 
2018) as a very intuitive example of a cognitive representation model. For our study 
we required participants to understand the model without any prior learning effort, 
as such we selected the geographical map as the most widely accepted model. Geo- 
graphical maps represent complex structures based on standardized criteria (typically 
distances). A map can be specified as 2-dimensional (/ x l e.g. a map of Germany) 
or 3-dimensional length space / x / x l e.g. an orbit map of our galaxies) By using a 
globally standardized metric to describe the orientation between very many entities 
within a space (e.g. all cities in a country), it becomes possible that a large society of 
people can coordinate within this shared space. For example, millions of deliveries 
are shipped about the whole world every day only based on one standardized geo- 
graphic world map. These characteristics make a geographical map very useful where 
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a large group of people want to coordinate in a shared space (Monmonier 2018). We 
ensure our model preference by asking the user about the tool familiarity of some 
other promising cognitive representation models. We measure tool familiarity as the 
degree model usage within the daily life. 

What form should the referring expression task take? We want setup a remote 
referring expression task in a shared environment, while such a task allows to observe 
if-based on a given referring expression an intended referent can be picked out (Clark 
et al. 2007). Such shared environment tasks are possible in two settings: First, the 
expert-novice setting (e.g. Müller et al. 2013; Brennan 2005), here one participant is 
familiar with the target (expert), while another participant who is not familiar with 
it (novice), has to identify it. Second, the novice-novice setting (e.g. Brennan et al. 
2008; Zubek et al. 2016) is about identifying a target, while the target is unknown 
to both of the participants (both act as novices). To evaluate the impact of shared 
markers on reaching semantic co-creation, it is required that shared markers can 
play a primary role for coordination purposes between the participants. Hence, we 
prefer to setup an expert-novice setting, because in such a setting, participants have to 
collaborate to identify the target. Nevertheless, in novice-novice settings it is possible 
to search separately (Miiller et al. 2013). 

The aim of our evaluation is to observe when and how semantic co-creation occurs 
within a group of participants. We decided to set-up a group of three participants, a 
describer, an actor and an observer. Under shared marker conditions the marker is 
visible to all participants, whilst under non-shared conditions the marker is not visible 
to the participant whose task it is to describe. In such conditions it is possible that 
the actor (participant carrying out actions) can help the observer (passive participant 
observing interactions between the other participants) by using the marker. 

The task itself should be implementable based on a geographic map as an example 
of a cognitive representation model. Here, the map task is one example, where one 
participant needs to explain the route of a map to a second participant (Anderson 
et al. 1991). The two participants are presented with the same map, the first partic- 
ipant is shown a route marked on the map and is asked to describe this route. The 
second participant marks down his comprehension of the route, based on the given 
description. The map task differs from other tasks as the communicative success is 
measured on a metric scale. Describing a route within a map is a complex task, which 
requires high intellectual effort of the pair involved. In our study we implement an 
easier target location task, similar to that used by Brennan (2005). In Brennan’s target 
location task a car icon has to be manoeuvred towards a target location. Only the 
participant whose task it is to describe can see the target location for the car on their 
map. The actor tries to find the unknown target location by applying the instructions 
described to him. The actor can use the shared mouse to relocate the car icon within 
the geographic map based on the hints given. Sharing mouse-movement is evaluated 
to uncover the current state of comprehension towards the grounding criterion, con- 
tinuously. In the study by Brennan (2005) the task is completed once the actor places 
the car icon very close to the target. Such a setting forces the actor to use the car as 
an existing marker. However, we want to make the use of the markers optional in 
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order to evaluate the benefit of using them. Hence, our task ends when one city is 
selected from a given list of all cities present on the map. In principal, it is possible 
to finish this task without moving the marker. 

How are model considerations implemented within the task? The aim of our task 
is to make the characteristics of the contribution model visible to all participants at 
any given moment of interaction. As described previously, the contribution model 
is implicitly present while participants are interacting in conversation. Nevertheless, 
there are no defined conditions concerning how brief or detailed each contribution to 
conversation needs to be in order to be understood. This lack of specificity leads to a 
bias of incorrect reward assessment by the participants. Applying a time constraint to 
the collaborative task incorporates time pressure and makes participant contributions 
briefer (Neider et al. 2010). One disadvantage of such time constraints is that it is 
harder to interpret how much effort the participants invest in conversation. Hence, 
we prefer to describe the task the participants have to complete as a collaborative 
conflict situation of least collaborative effort to reach semantic co-creation. 

The “conflict” becomes visible to the participants online through scores assigned 
to participants’ actions based on the delay discounting decision problem (Scherbaum 
et al. 2016). In delay discounting decision problems, a single participant has got two 
options of which they have to select the most beneficial one. The first option is 
named sooner smaller (SS) option, which means the user can get this one very fast 
but he needs to accept a lower reward value. In contrast, the second option is named 
later larger (LL) option. This option returns a much greater value to the user, but 
it is much more difficult to reach it resulting in a long delay. Unlike in the classic 
single participant approach, multiple participants who are trying to coordinate try to 
reach the highest degree of value discounting. Based on an initial reward score value 
each team member has to ensure that their actions reduce the team score as little as 
possible, while reaching the grounding criterion should be achieved as quickly as 
possible. In our case, the describer needs to decide whether they want to apply a more 
detailed description (SS-option) or only slight hints about the location, e.g. using the 
words “hot” or “cold” (LL-option). If a participant applies his own description, he 
wants to ensure that he reaches the grounding criterion fast, even though only a 
small team discount can be achieved. In contrast, if a describer applies only hints, he 
tries to achieve a larger team discount, while it becomes more difficult for the team 
to reach the grounding criterion, because such hints are much less informative. In 
our case the actor and observer need to decide whether to select just a target subset 
(SS-option) or whether they want to know the exact location (LL-option). Actors or 
observers who only select a target subset slow down the required grounding criterion, 
because the correct answer needs only one element within the given subset. This 
option has the disadvantage that the team score decreases very much. In contrast, 
if an actor or observer selects a unique correct answer the requirement to reach the 
grounding criterion is much higher, because only one correct answer needs to be 
identified. This more delayed option seems charming because it decreases the team 
score much less. In applying participant action scores in the coordination task, it 
will be possible to measure the degree to which the SS and LL principles have been 
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used, interactively. For implementing collaborative delay discounting problems, we 
decided to implement the text chat tool instead of audio channel communication 
(e.g. Neider et al. 2010). 

What are the test conditions? Test conditions of referring expression tasks can 
be described using the limitations of communication media (Clark and Brennan 
1991) along with the content provided with descriptions or identification skills in 
order to determine the credibility of other participants (Edwards and Myers 2007). 
Co-occurrence assumes that the participants are present at the same time. Our task 
is applied to three participants, who have access to the same shared workspace in 
different roles. Together with a shared cognitive representation model the participants 
can communicate by using a shared chat system. Here, they can write and read 
messages at the same time (simultaneity), they can look at older messages within the 
chat protocol (reviewability) and read their messages before they submit them into 
the chat (rereading). In shared space scenarios communication delay becomes an 
additional critical issue (Kraut et al. 2002). The communication between the nodes 
based on LAN connection as well as our script performance happens without any 
perceivable delay. Furthermore, we need to ensure that the task is achieved based 
only on the geographical map and chat messages available to the participants. No 
additional communication media should be used (e.g. other messenger services), no 
other sources of information should be available (e.g. Wikipedia) and no common 
ground should exist between the participants before starting the task. To guarantee 
these test conditions, participants worked at prepared working stations, which only 
offered access to the testing environment. The use of mobile phones was prohibited 
during the evaluation. 

Within the test conditions for the cognitive representation model it is consid- 
ered the following questions: what do the participants already know about the map 
(pre-existing background knowledge (Brennan 2005)), how are the cities of the map 
structured (entity structure) and can the participants use a symbol for a city for com- 
munication purposes (symbol entity referencing). Pre-existing background knowl- 
edge occurs where participants have some common ground beyond the task, which 
could help them to complete the task more easily. If our task were to use a map of 
Germany and the participants were German they could use their background knowl- 
edge to identify particular places on a common map faster than if their task involved 
a map of an area unknown to them all, e.g. Ukraine. Our study, in fact, deploys maps 
of Ukraine and other countries for which we consider it unlikely that participants will 
have previous knowledge of. The second factor, entity structure, is about the com- 
plexity of coordination within the cognitive representation model. This complexity 
is indicated by the number of elements and the proportion of reference points from 
all elements (non-reference points). A reference point is a location with discrim- 
inable features, which allow a subject to have a geographical orientation (Sadalla 
et al. 1980). Having reference points improves orientation in cognitive representa- 
tion models (Hanrieder 2017). Within most maps there are a set of reference points, 
in our case popular cities in a country (e.g. Kiev in Ukraine). If a reference point was 
a target the participants could refer directly to cities based only on their name (e.g. a 
city which is marked “Kiev”). To identify the correct target the describer could send 
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this description to the actor, who can identify this place easily, without any further 
interaction. To avoid such a behaviour, we add a set of randomly chosen less well- 
known cities, one of which needs to be identified. We introduce these random cities 
with a symbol instead of a name. This approach prevents town names from being 
used as descriptions. However, the describer could refer to a town’s symbol instead 
(“go to the town xı”). To solve this potential problem, we inform all participants 
that their symbols for the given towns are all different, making such references no 
longer useful. This should prevent the participants from using the symbols of towns 
for coordination purposes. It also simplifies the map, because only reference points 
can be used between the describer and the actor. An increasing number of reference 
points makes orientation on the map landscape easier. Just as well, identifying a 
target location become easier if the number of potential decision points is smaller. 
Low map complexity means there is a large number of reference points and a small 
number of non-reference points. We defined complexity level 1 (low) as consisting 
of 5 reference points and 10 non-reference points. Map complexity level 2 (high) 
comprising | reference point and 25 non-reference points. 


6 Methods 


As each task is limited to a duration of five minutes, each team completes the whole 
experiment (six trials) within 30 minutes. Both the actor and the observer should 
identify the target location which means there are two task results per task. In total 
twelve task results are recorded for each team. Resulting in a total of 156 instances 
for evaluation. 


6.1 Participants 


Our task was completed by 13 groups each consisting of three participants, with 6 
trial rounds. In total there were 39 participants, of which 17 were female, the average 
age of participants was 32. All participants had normal vision, or their vision was 
corrected to normal with glasses or contact lenses. Each of the participants gave 
informed consent to take part in the study and received a natural gift (a bottle of 
water or a piece of fruit) after completing the experiment. Each team member of the 
winning team was given a bouquet of flowers as a gift. 


6.2 Apparatus and Stimuli 


Stimuli were presented on three laptops simultaneously, each with a normal RGB 
background on a 14.1-inch screen at a resolution of 1377 x 768 pixels with 60 Hz 
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refresh rate. In our evaluation, we use an individually implemented analytics pipeline 
using a survey tool and a pre-processing tool. The survey tool handled our specified 
task, we described in the last section. It was implemented using PHP and AngularJS 
and had an integrated MySQL relational database. The chat environment was based 
on Socket.io. The presented geographical map was implemented by D3 and TopoJ- 
SON in a similar way to the tutorial by Mike Bostock. Additionally, we implemented 
the Natural Earth dataset of GDAL to create each of the maps which included country 
polygons and populated cities within each country. The pre-processing tool sets up a 
database for survey data and transformed this data into a dataframe, which could then 
be directly evaluated using statistical analysis tools such as IBM SPSS Statistics. 
The task user interface: The interface of the team workspace consists of a shared 
geographical map, a chat system, information on the current reward and remaining 
time and an area to apply participant interactions, with options such as “add a descrip- 
tion” or “select target location” (see also Fig. 3). The describer, whose task it is to 
describe the target location, views the same geographical map as the other partici- 
pants, but additionally one of the random cities highlighted. The describer can use 
two forms of participant interaction. They can describe the target location freely or 
use pre-defined hint buttons, such as short messages indicating “cold” (far away) or 
“hot” (close) within the chat system. The maximal message length for communica- 
tion is 67 characters, comparable to the length of an SMS. Both interactions of the 
describer are reward related. A short hint (“cold” or “hot”) relates the SS-option and 


Fig. 2 Paper prototype of our map task: The paper prototype of the map task containing a simplified 
representation of France including Paris as a reference point (popular city) and four non-reference 
points (random cities), which are referred to via symbols. In preparation of the three team members, 
each participant of a team was positioned randomly around the map according to the roles described 
at the edges of the paper 
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Fig.3 Target location task interface: The user interface has a shared geographical map of a country 
like Germany containing a set of well-known cities (e.g. Bonn) and hidden cities (marked with 
symbols e.g. b). In the user interface windows on the left, those of the describer, town “A” is marked 
red, this is the city they have to describe the location of. The user interface windows on the right 
show that the actor sees the same cities but marked with different symbols. The describer needs to 
refer to the city of Bonn and indicate with a message that the target location is “south of Bonn”. 
The actor can respond to this message by moving the marker (shown here as a pin above the map) 
or by replying to the message or selecting a reply option for a predefined answer set 


reduces the reward by 10 points, sending a longer text message is the LL-option, 
which reduces the reward by 50 points. 

The concept of collaboration: All participants are directly updated about com- 
munication methods used as they can all see the remaining reward amount. The actor 
can read the describer’s messages and thus move the marker towards the potential 
target location. This marker is visible to all participants but can only be moved by 
the actor. Via the marker the actor can indicate where they assume the target location 
to be based on the describer’s messages. The actor can also comment on any given 
explanation of the describer freely, without any costs regarding the reward. The actor 
has two options to complete the task: (a) by selecting the target location they assume 
to be correct (LL-option—select 1 city of 10 options in complexity level 1 or 25 
options in complexity level 2) or (b) by selecting a subset of target locations, one 
of which should be the target location (SS-option—i.e. select 1 of 5 options, while 
each option represents in complexity level 1 two cities or in complexity level 2 five 
cities). Both of these options are also reward related: the LL-option results in a fur- 
ther reduction of the reward by 50 points, whereas the SS-option results in a 10 point 
reduction of the reward. In SS-options, cities are clustered based on their proximity 
by using same-size k-means clustering. 
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6.3 Procedure 


Preparation: Before starting the task, a moderator explained the task and the partic- 
ipants of a team completed an initial test-round together by using a simplified paper 
prototype (see also Fig. 2). Besides, the participants had to evaluate how familiar they 
were with popular cognitive representation models. The moderator also informed the 
participants about the basic notion of these cognitive representation models by means 
of an example. The three participants carried out the task in separate rooms, each pro- 
vided with a laptop. Communication among participants was only allowed within 
the provided chat room. Participants had to deposit their smartphones outside the 
room and the provided laptops had no internet access. After an introduction to the 
task by a moderator, each participant was left alone in their respective room with 
the laptop and the task without any further discussion with the other team members. 
Each participant had to sign-in to a shared workspace where they were then randomly 
assigned a role (describer, actor or observer). 

The location identification task: With the shared workspace each participant 
within a team sees the same map of a country. This country map contains a set 
of reference points (popular cities of a country, e.g. Berlin, Miinchen, Hamburg, 
Frankfurt or Stuttgart for Germany) and a set of non-reference points. Non-reference 
points are cities which are labelled by a personally chosen random symbol (like a; or 
a2) The team of participants begins the task together with a time limit of 5 minutes. 
The aim of the task is for the actor and the observer separately to identify the correct 
city as efficiently as possible, based on the hints given by the describer. The reward 
for achieving this task is 1000 points at the start of the game; this reward decreases as 
time passes and with increasing participant interactions. The team with the highest 
score after successfully completing the task wins. Unsuccessful teams who do not 
manage to complete the task end the game with 0 points. Additionally, the task ends 
earlier if the reward is reduced to 0 points based on the team’s amount of participant 
interactions. After starting the task, the remaining time is displayed in the shared 
space, along with the current reward, which decreases at a rate of 1 point per second. 

Finishing the task: Once the actor has correctly identified the target location by 
selecting the correct target location, all participants are informed via the chat that 
the actor has finished, the location is however not visible to the other participants. If 
the actor has completed the task first, they should help the observer to also find the 
correct target location. Hence, an actor who has finished stays on in a running game 
session and can write messages to the describer and move the flag. Giving answers 
is not possible any longer. Observers are by nature not allowed to interact with the 
shared space, but they can see everything which is happening. They can use additional 
user interactions of actor and describer to identify the correct location. Unlike the 
actor, the observer cannot interact and send chat messages or move the marker. The 
observer is truly just an observer who can see what the other participants are doing. 
In addition, she can apply an answer to finish her task. When the observer and the 
actor select the same target location they are rewarded with the same number of 
points. If the observer completes the task by correctly identifying the target location, 
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afterwards they are not able to give hints and can only continue to observe the other 
participants’ behaviour. When both the actor and observer have completed the task, 
the team is rewarded with the current reward visible to them on their screens. 

The marker condition: When only one participant has been able to complete 
the task the reward amount continues to decrease until the maximal task duration 
has been reached. The task is carried out under two conditions: either the marker 
is visible for the describer or not. Under “no marker” condition, the marker is still 
visible for the actor and the observer (and can be moved by the actor). Therefore, the 
marker is only helpful for the participants in these two roles. 


6.4 Design 


Each participant of a team is randomly assigned a defined role (describer D, actor 
A or observer 0). In each role the task is applied either with the use of a marker 
(M) or without a marker NM. This combination of three different roles and two 
different marker conditions results in a total of 6 trials per team. For example, the 
following trial-order could be applied: (1) O — NM; (2) D — M; (3) A — NM; (4) 
A — M; (5) O — M; (6) D — NM. Each gameplay consisting of 6 rounds is based 
on geographical maps of the same complexity level. Whether a gameplay is based 
on complexity level one or two is assigned randomly. 

A new geographical map was generated for each trial, from a set of the 6 countries. 
Each country contained more than 100 possible cities.? For each country only ten 
cities were selected as candidate reference points (popular cities), the rest were cate- 
gorised as potential non-reference points (random cities), which were also randomly 
selected. 


7 Results 


Our initial focus was the use of the cognitive representation model. To confirm the 
suitability of the geographical map as a preferential cognitive representation model 
we asked participants to assess their usage of four cognitive representation model 
options (a geographical map, a biplot, conceptual space and a semantic network) 
in their daily life. The results, based on a 7-point likert-scale (from (1) “I don’t 
know what this is” to (7) “I am using it regularly in my daily life”) are shown in 
Fig.4. The results reveal that the conceptual space and the biplot were the most 
unknown representation models. 23 of 39 participants didn’t know what conceptual 
space was or had never used it. Similarly, biplots had never been used by 24 of 39 


Mexico incl. 1190 cities, Norway incl. 417 cities, Philippines incl. 156 cities, Puerto Rico incl. 
242 cities, Portugal incl. 286 cities, Ukraine incl. 510 cities. 
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Fig. 4 Intensity of usage: 39 Participants evaluate how intensively they use four types of cognitive 
representation model: a geographical map, biplot, conceptual space and semantic network. The 
diagram shows how often the selected options (ranging from “I don’t know what this is” to “I am 
using it regularly in my daily life.”) were chosen for each cognitive representation model 


participants. In contrast, semantic networks were more well-known and used by a 
larger proportion of participants. Of 39 participants, 24 stated that they used semantic 
networks, here responses ranged from “I have used it sometimes, but some time ago” 
to “I use it, but not regularly in my daily life”. Nevertheless, the geographical map 
was evaluated as the most widely used cognitive representation model. 31 of 39 
participants confirmed that they used geographical maps, even though not regularly 
in their daily lives. Based on these results we consider the geographical map as a 
suitable cognitive representation model for our study purposes, as it can easily be 
used by a broad range of participants. 

We also wanted to evaluate whether the complexity of a geographical map influ- 
ences the level of interactivity used to complete the task. Based on the principle of 
least collaborative effort, interaction itself contains the application of linguistic con- 
straint tools which are used more intensely in complex situations. We hypothesised 
that if the complexity of a cognitive representation model is too low, then no team 
interaction emerges. To investigate this, we compared the two levels of map complex- 
ity. Complexity should influence all levels of interactivity, which describe the nature 
of a task round. As such we evaluated several indicators of interactivity: how often 
answers were given, the number of messages the describer sent and how often the 
actor responded to a message or moved the marker. Table 2 lists these indicators of 
interactivity. Using the complex geographical map, the describer had to send consid- 
erably more long messages (53.3%) and short messages (60.0%) than under simpler 
map complexity conditions. Where initial descriptions could not narrow down the 
target location enough, multiple long messages were required. Short messages used 
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Table 2 The relationship between complexity and interactivity: Several indicators of team inter- 
action are compared with two levels of complexity. Complexity level 1 (2) contains 5 (1) reference 
points and 10 (25) non-reference points. Complexity level 2 compared with level 1 requires a much 
higher degree of interactivity 


Complexity level... 


Interactivity indicator 1 (5-10) 2 (1-25) p-value 
Describer 

Sends two or more long messages | 29.6% 53.3% 0.006 
Gives short feedback (e.g. 12.5% 60.0% 0.000 
“hot” or “cold”) 

Actor 

Moves the marker at least 11.1% 37.8% 0.000 


three times 


Sends two or more comments 11.1% 44.4% 0.000 


Actor and observer 


Answers more than once 0.0% 24.4% 0.000 
(single-answer-option) 
Answers using the answer set 0.0% 34.4% 0.000 


by describers under these conditions tended to be small hints relating to previous 
interactions with the actor. Under complex map conditions (complexity level 2) the 
actor used the option of moving the marker more often (37.8% more than under 
less complex map conditions) and responded to the describer more frequently. An 
average of 44.4% of all actors gave feedback with at least two comments. Under map 
complexity level 1 participants selected the correct target city, rather than a subset of 
target cities. Comparing the map complexity levels there was a significant difference 
between the number of long messages sent by describers (p < 0.01), responses by 
actors (p < 0.01), movements of the marker by actors (p < 0.01) under the two 
map complexity conditions. Overall, there is a significant difference in the degree 
of team interaction required to complete the task between complexity level 1 and 
level 2. Figure5 visualizes these differences based on two session examples. The 
results show that in contrast to complexity level 2, complexity level 1 requires very 
little team interaction to solve the task. Only when complexity increases does the 
level of interaction become more intense. 

We also assessed the influence of the marker and the degree of interactivity on 
communicative success. From our previous observation we conclude that complex- 
ity (as a control parameter) influences the nature of interactivity. Hence, we focused 
on trials with map complexity level 2, where the participants appeared to be under 
higher pressure to interact. We used our results to evaluate the interactivity hypoth- 
esis, initially observed by Zubek (2016): When a pair of participants try to identify 
a target, they perform better than where one participant attempts this task alone. We 
reformulate this statement for our study: When a pair of participants interact inten- 
sively, their performance (in terms of task completion) is better than when interaction 
is very limited. 
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Two degrees of interactivity 
Example of a non-interactive session: 


1. describer: “erste Stadt oben links (westlicher Rand)” (“first city top left 
(western edge)” ) 
2. actor: got the correct answer 


Example of an interactive session: 


1. describer: “siid-dstlicher ort” (“south-eastern place”) 

2. actor: “eher nah bei Shiraz oder weiter davon entfernt” (“close to Shiraz or 
further away”) 

3. describer: “nicht direkt neben shiraz. der entfernteste ort siid-dstlich davon” 
(“not directly next to Shiraz. The furthest place south-east from there”) 

4, actor: got correct answer 


Fig.5 Two degrees of interaction: In the first example, the describer sends one message and the actor 
was able to identify the correct target based solely on this description. Contrastingly in the second 
example further interaction and refinements are required to select the correct target. It becomes 
apparent that interaction only emerges when an initial linguistic constraint tool—the message sent 
by the describer—is not sufficient to reach semantic co-creation (to complete the task) 


While we focused on cases where map complexity level 2 was used, interactivity 
was measured containing only most distributed variables. Most distributed variables 
are those having the biggest diversity of observed values. In our results, the num- 
ber of comments made by the actor and the number of movements of the marker 
were the measures of interaction with the greatest variability. We compared these 
measurements with indicators of communicative success. The main indicator of com- 
municative success is how many participants successfully completed the task. Here, 
the answer can be two (the actor and observer), one (either the actor or the observer) 
or none (neither the actor nor the observer). Further indicators of communicative 
success are the time taken to complete the task, for the first and second participant in 
each team to finish. The results for these indicators are summarized in Table 3. Based 
on these results the hypothesis that participant pairs perform better when interacting 
more must be rejected. Of all teams requiring 0 comments for finishing a task, 86.5% 
were successful. Likewise, of all teams that required 2 comments or more, only 40 % 
were successful. This observation is underlined when we look at the duration when 
the first participant (be it actor or observer) was able to finish the task. Teams which 
required no (vs. two or more) comment for finishing a task, were able in 91.7% (vs. 
32.5%) to finish a task in fastest half of the first completion times, successfully. Same 
is true when we look at the number of marker movements. Teams which moved the 
flag 0 times (vs. 3 or more times) were in 85.0% (vs. 29.4%) under the fastest half 
of first completion times. Summarizing we can state that the best performing part of 
a team performs better if they do not interact intensely. Best performing part in this 
sense means two of three participants having one identifying participant, who was 
able to complete the task faster. Here, no benefit of interaction on communicative 
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Table 3 The relationship between participant interactions and communicative success: Two indi- 
cators of communicative success are compared with two measures of the degree of interaction. It can 
be summarized that the participants were most successful if they did not interact. However, it has 
become obvious that interaction is helpful in a way that all participants finish the task successfully 


Indicator of communicative | Number of comments used Number of marker movements 
success 


Degree of... 


All participants successfully | 86.5% 40.0% 0.004 75.0% 44.1% 0.119 
completing the task 


Fastest 50% of teams 12.5% 60.0% 0.000 15.0% 61.8% 0.001 
completing the task 


Fastest 50% of participants | 91.7% 32.5% 0.000 85.0% 29.4% 0.000 
completing the task 


success can be observed. Nevertheless, the results also make clear that the final com- 
pletion was improved by a high degree of interaction. Teams which required no (vs. 
two or more) comment for finishing a task were only at a level of 12.5% (vs. 60.0%) 
under the fastest half of participants to the full completion of the task. Similarly, of 
teams which did not (vs. 3 or more) move the marker, 15.0% (vs. 61.8%) were under 
the fastest half of participants to full completion of the task. 

All these observations are significant with a level of at least p < 0.01. It should 
be noted that the number of marker movements did not correlate to the number of 
participants successfully completing the task. 

Our evaluation also considered the impact of a marker as a linguistic constraint 
tool on communicative success. Adapted from the basic notion by Zubek (Zubek 
et al. 2016), we hypothesised, that the use of a marker improves team interaction 
as it focuses communication on critical aspects. More concretely, teams using a 
marker in intensive interaction should attain the highest communicative success. 
Based on our results this hypothesis should be rejected. We evaluate sessions with 
a low and a high level of interactivity in contrast with the marker condition. While 
the map complexity is only a control variable set from outside, the interactivity 
level is a phenomenon which is inherently part of team communication. Further the 
comparison uses the independent variables of investigation in the previous section 
above. In contrast, we are only interested in the degree of actors who moved the 
marker or wrote a comment based on a given description. We can observe that having 
a marker, generally becomes a disadvantage. Of all users who moved a marker once 
or more often, 78.6% (non-interactive sessions), 70.0% (interactive sessions) were 
successful in finishing the task while the marker was not visible to the describer. In 
contrast, when the marker was visible to the describer only 61.5% (non-interactive 
sessions), 40.0% (interactive sessions) finished the task successfully. Same is true if 
we look at the first completion time. When the marker was not visible to the describer 
91.7% (non-interactive sessions), 85.0% (interactive sessions) were under the fastest 
half of first completion time. In contrast, when the marker was present to the describer 
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only 32.5% (non-interactive sessions), 29.4% (interactive sessions) were under this 
fast half subset. A similar observation occurs when we evaluate those sessions in 
which the actor applies minimally one comment. Here, we can see a disadvantage, 
when teams were heavily interactive and using the marker. The numbers of all users 
that finished with writing at least minimally one comment differs between 75.0% of 
teams having no marker and not being interactive to 23.1% of teams having a marker 
and being interactive. Furthermore, 80.0% of all teams using no marker and not being 
interactive ranked under half of first completion time. In contrast, interactive teams 
which used a marker and wrote at least one comment, achieved only 30.8% of the first 
completion times. With reference to marker movement only, we can further observe 
that interactivity was helpful to finish a task to the slowest identifier (nevertheless if 
it was the actor or the observer). Interactive teams where the actor moved a marker at 
least once, were under the fastest half of full completion times in 70.0% (no marker 
condition) 63.3% (marker condition) of all cases. In contrast, non-interactive teams 
were not so fast, only 24.1% (no marker condition) 30.8% (marker condition) were 
under the fastest half of full completion times. 

In sessions with intense interaction while the marker was visible to the describer 
and all other participants, few teams completed the task (23.1%) successfully and 
also few subteams were under the fastest half of first completion times (30.8%). In 
contrast, teams with a low degree of interactivity and no marker shared with the 
describer achieved the best communication success. Under this condition, 75.0% of 
all users finished the task successfully, while 80.0% were under the fastest half of 
the first completion time. It can be concluded, that especially having a marker as 
a linguistic constraint leads to disadvantage in achieving communication success. 
It needs to be noted, that our first observation doesn’t reach the required signifi- 
cance level of p < 0.05. Moreover, we have decided to add this observation to our 
considerations because the level of significance reached is very close to the required 
significance level and this observation supports our big picture. A further observation 
of the full completion time based on comments is not under consideration because 
it is not significant (Table 4). 


8 Discussion and Conclusion 


The aim of the contribution model of conversation is to reach the grounding crite- 
rion with least collaborative effort, which is fundamental to achieve communicative 
success based on semantic co-creation (Clark and Schaefer 1989). Linguistic con- 
straint tools play a critical role in reaching the grounding criterion within interactions. 
Hanna and Brennan already mentioned the importance of constraints even outside of 
discussion (Hanna and Brennan 2007). Nevertheless, the applied constraint model is 
focused with text-processing and can not explain the adaptive nature of these con- 
straints. The notion of linguistic constraints (Raczaszek-Leonardi and Kelso 2008; 
Pattee 1997) overcomes this limitation based on clear epistemological roots. Previous 
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Table 4 The relationship between interaction and the marker condition on communicative success: 
The independent and dependent variables are compared separately, based on the interaction level 
and the marker condition 


Conditions 2 x 2 setting 
Interactivity level Few interactions Substantial interaction 
Marker condition Not-present | Present Not-present | Present 


Indicators of communicative success 


Degree of... Actors, that moved the marker once or more often p-value 
All participant success- | 78.6% 61.5% 70.0% 40.0% 0.051 
fully completing the task 

Fastest 50% of teams 24.1% 30.8% 70.0% 63.3% 0.003 
completing the task 

Fastest 50% of partici- 91.7% 32.5% 85.0% 29.4% 0.000 
pants completing the task 

Degree of... Actors that wrote 1 or more comments p-value 
All participants success- | 75.0% 73.3% 71.4% 23.1% 0.000 
fully completing the task 

Fastest 50% of teams 35.0% 43.3% 71.4% 53.8% 0.171 


completing the task 


research has investigated several linguistic constraint tools—like ontology (Zubek 
et al. 2016), contextual restrictions (Müller et al. 2013), cognitive representation 
model (Keilmann et al. 2017) or shared marker (Hanna and Brennan 2007). Linguis- 
tic constraint tools can help to improve team interaction and make conversation more 
focused (Zubek et al. 2016). Previous studies have in common that they prevent an 
optional use of a shared marker. The users have to move a marker to finish the given 
task successfully. From our point view, such a setting is unfair because it prevents the 
natural competition between shared marker and discussion as two parallelly applied 
linguistic constraint tools. 

Previously, the advantage of team focused interaction was observed in complex 
study situations (e.g. identifying the correct wine through smelling and tasting based 
on a collection of available wines). Some referring expression tasks using a shared 
marker require a total task duration of less than 20s. So, we asked if team focused 
interaction brings a general advantage, independent of the given decision complexity. 
Our study design compares the availability of a shared marker and two levels of 
decision complexity. Controlled by the number of cities in a map, we simulated two 
situations, where the perceived decision complexity differed. As stimuli we used a 
shared geographical map that contained a number of cities of which only a very small 
subset could be used as reference points. Furthermore, we reflected the principle of 
least collaborative effort as a collaborative delay discounting problem (Scherbaum 
et al. 2016) to the participants. The participants are not only moment-by-moment 
aware of how far or close they are from a state of semantic co-creation (Brennan 
2005), but also the value perspective is visible to the user moment-by-moment. 
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Continuously the teams can evaluate the major characteristics of the contribution 
model of conversation. We classified sessions in low and high level of interactivity 
and compared them with having a marker shared with the describer or not. Based on 
our setup each team consists of a triad of participants (describer, actor and observer). 
This team configuration allows us to design a natural way of having a shared marker 
or not. If a marker is not shared with the describer, it still becomes useful because it 
is helpful between actor and observer. 

We observed significant differences in the degree of team interactivity under dif- 
ferent complexity conditions. At the lower complexity level, participants were able 
to use simple descriptive messages which were sufficient for accurate target iden- 
tification. However, it is not enough to provide a channel for team interaction. Up 
to now the team focused interaction hypothesis assumes that if there are two people 
who can communicate it improves the identification accuracy in general. The obser- 
vation in this study shows that while team interaction was possible in principle it 
was not really happening. One description was sent and immediately the identifica- 
tion happens in the next step. This pattern of discussion was already observed by 
Neider et al. (2010). Nevertheless, it has become obvious that team interaction by 
discussion itself requires a specific level of perceived decision complexity to make 
it an advantage. If the decision complexity is too low, real team interaction is not 
required. 

Considerable amounts of interaction were only observed under higher complex- 
ity conditions. Nevertheless, we observed that not heavily interacting participants 
achieve the best performance in accuracy and task duration. That means if perceived 
decision complexity reaches a specific level it becomes an appropriate tool. This 
can be interpreted in a way that team interaction is also some kind of linguistic 
constraint tool, which underlies the principle of least collaborative effort. Still, we 
could also observe that participants interacted more intensely when they were hav- 
ing problems with identifying the target. Our results have shown that with a given 
level of complexity, a shared marker becomes useful to solve conflict situations. In 
similar fashion Brennan et al. (2005) already observed that if a shared marker was 
not present, situations evolve where in the end an actor is for a long time not far 
from the target and still does not identify it correctly. From our perspective a conflict 
situation relates to an increased perceived decision complexity that is why a marker 
seems to be advantageous. 

In general, our results on complexity level two show, that using a shared marker 
was disadvantageous for the teams. In teams that had a shared marker the describer 
required more time to complete the task and they were less accurate in identifying the 
correct target. In contrast, in teams with a low degree of interactivity and no marker 
shared the describer achieved the best performance. It needs to be noted, that similar 
to our previous observation, it becomes obvious that the second identifier was faster 
when the team interacted heavily. From this observation we infer that the marker (as 
linguistic constraint tool) is not useful for the given level of complexity. Finding this 
certain level of complexity was not the aim of this study though. 

To sum up, these observations confirm the principle of least collaborative effort 
towards the team focused interaction hypothesis. Team interaction and focused inter- 
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action are two separate linguistic constraint tools, having different characteristics, 
and hence need to be evaluated separately. Perceived decision complexity plays a cru- 
cial role so that team interaction becomes beneficial. Only when situations become 
complex team interaction is more and more helpful. Using a linguistic constraint tool 
(for example a marker) on top of a cognitive representation model requires a even 
higher level of complexity, while team interaction is already helpful. It might be the 
case, that if a situation is perceived as complex then in a first step we add teams and 
let them interact freely. If this is not enough, then we try to get interaction become 
more focused by using additional linguistic constraint tools. 

Our observations are based on three limitations. First, the study was based on 
the use of a geographical map as a cognitive representation model. It is not clear 
whether these results can be replicated using other cognitive representation models. 
The characteristics of other cognitive representation models could also lead to other 
limitations regarding linguistic constraint tools. Additionally, using a geographical 
map implies a scenario in which spatial language is required to achieve a state of 
semantic co-creation (Spranger 2016). In other scenarios, for example searching 
for a target within text documents other linguistic practices of language use are 
required. For each different scenario appropriate linguistic constraint tools should 
be applied. Secondly, only a subset of linguistic constraints was evaluated. If we 
had only been evaluating the use of a shared marker within a shared space, we could 
have distinguished between different types of use, e.g. shared-eye-tracking or shared- 
mouse-tracking (e.g. Müller et al. 2013). A comparison of the suitability of several 
possible linguistic constraint tools was however not part of this study. Finally, the 
results show that the team focused interaction does not become true in general. A 
low decision complexity leads to an outstanding benefit team interaction and team 
focused interaction as well. The question remains open, if we generally have to reject 
the team focused interaction hypothesis, when we use shared cognitive representation 
models, or if a complex enough situation is required so that the hypothesis becomes 
true. Moreover, research should include the use of more complex geographical maps, 
containing for example 1,000 to 10,000 random locations to further evaluate the 
validity of this hypothesis. 

Based on our results, we recommend that further studies focus on complexity 
when evaluating a specific cognitive representation model. The assessment of tools 
which are appropriate for implementation in search tasks in cognitive representation 
models should recommend simple filtering tools when the complexity is low. In 
contrast where complexity is high more sophisticated filtering tools are required to 
successfully complete search tasks. Using sophisticated filtering tools in complex 
environments leads to new situations with lower complexity where simple filters 
are preferred. Simple and sophisticated linguistic constraint tools can comprise any 
form of linguistic constraint. In our study we focused on investigating the impact of 
these linguistic constraints on conversation with respect to the aim of reaching the 
grounding criterion using the least collaborative effort. 
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1 Introduction 


Knowledge of object use is one of the most important available types of knowledge 
for a living being. For instance, humans can make use of a hammer to nail wooden 
planks and build a house, chimpanzees can use a twig to “fish” for insects, and birds 
of prey called bearded vultures, or lammergeiers, can make use of stones to break 
bones and feed themselves with marrow. 

A basic issue in human cognition is how information concerning actions with 
objects is represented. Are motor representations critical components of object 
concepts? This question taps into the ongoing debate on the format (i.e., neural 
substrate, patterns of activation) of conceptual representations (for an overview see 
Scerrati 2017; Scerrati et al. 2017). Such debate critically involves two out of the three 
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main research questions outlined in the present volume, that is, how concepts become 
acquired and how they are being used in cognitive tasks. The current research is a 
psychological investigation, which attempts to address these questions and, specif- 
ically, how concept learning and representation interact with the development of 
motor abilities. 

An increasing widespread view assumes that knowledge is grounded in sensory- 
motor experiences (Barsalou 1999, 2008, 2016; Gallese and Lakoff 2005; Glen- 
berg and Kaschak 2002, 2003; Glenberg and Robertson 2000; Pulvermiiller 1999, 
2001; Zwaan 2004). The semantic analysis reported in Vernillo (Chap. 8) demon- 
strated that the literal meaning of action verbs poses constrains on their usage in 
metaphorical sentences. Neuropsychological research provides further support for 
the grounding assumption by showing the existence of selective impairments at the 
expenses of specific categories of information. For example, following a stroke, 
a viral infection or a neurodegenerative disease, such as the Alzheimer disease 
(AD) or Semantic Dementia (SD), people may selectively lose knowledge of living 
animate (i.e., animals) or inanimate (i.e., fruit/vegetables) entities, conspecifics (i.e., 
other people) or non-living things (i.e., manipulable artefacts). According to the 
sensory/functional theory (Warrington and McCarthy 1983, 1987; Warrington and 
Shallice 1984; see also Damasio 1989; Farah and McClelland 1991; Humphreys and 
Forde 2001; McRae and Cree’s 2002), category-specific deficits can be explained 
by assuming that knowledge of a specific category is located near the sensory and 
motor areas of the brain dedicated to perception of its instances’ perceptual quali- 
ties and kind of movements. Therefore, when a sensory-motor area is damaged, the 
processing of instances of the specific category that rely on that area is impaired. 
Importantly, neuropsychological research also suggests that sensory-motor represen- 
tations are involved not only in comprehending and producing voluntary movements 
but also in thinking about them (Buxbaum et al. 2000). 

In addition, neuroimaging studies have largely shown different neural activations 
for different categories. For instance, Chao et al. (1999, 2002) found differential 
activation for animals and tools. Furthermore, Chao and Martin (2000) described 
regions in the dorsal visual pathway, such as the posterior parietal cortex, that were 
differentially recruited when participants viewed manipulable objects like tools and 
utensils. Also, semantic knowledge of actions has been shown to involve different 
loci of representation in the brain than semantic knowledge of entities, specifically the 
frontal lobe motor-related areas (see, for example, Hickok 2014; Kemmerer 2015). 
Interestingly, a growing body of neuroimaging research also shows that knowledge of 
object use is automatically activated upon naming (Chao and Martin 2000; Chouinard 
and Goodale 2010), categorizing (Gerlach et al. 2002), and even passively viewing 
manipulable objects (Creem-Regehr et al. 2007; Grézes et al. 2003; Vingerhoets 
2008; Wadsworth and Kana 2011). 

Similarly, several behavioral studies showed that semantic content influences 
reach-to-grasp movement responses. For instance, Gentilucci and Gangitano (1998) 
found that automatic word reading influenced grasping movements: Their subjects 
automatically associated the meaning of the word (“corto: short”, “lungo: long”) 
with the distance to cover in order to perform a grasping action and activated a motor 
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program for a nearer/farther object position. Glenberg and Kaschak (2002) showed 
that judging sensibility of sentences was easier when the movement implied by the 
sentence was in the same direction as the movement required by the response. In 
a similar vein, Zwaan et al. (2002) showed that object verification and naming was 
easier when the object’s shape on display matched the shape implied by a previously 
presented sentence. Furthermore, Glover et al. (2004) demonstrated that reading 
words describing objects activated motor tendencies, which influenced the grasping 
of target blocks. Lindemann et al. (2006) further showed that action semantics acti- 
vation hinges on the specific action intention of an actor. Importantly, Myung et al. 
(2006) showed similar effects of semantics with a lexical decision task that required 
keypress responses: Performance on the target word was better when semantically 
dissimilar prime-target pairs shared manipulation information (e.g., typewriter and 
piano). 

Although much is known about how semantic content mediates action in response 
to the environment, the influence of motor activation on semantic processing did not 
receive as much attention. The present study aimed at filling this gap by focusing 
on potential effects of action on language. If, as assumed by the sensory/functional 
theory (Warrington and McCarthy 1983, 1987; Warrington and Shallice 1984; see 
also Damasio 1989; Farah and McClelland 1991; Humphreys and Forde 2001; McRae 
and Cree’s 2002), conceptual content is stored closed to the sensory and motor 
systems, and, as claimed by the grounded view, semantics shares a common neural 
substrate with the sensory and the motor systems (Barsalou 1999, 2008, 2016), then 
effects should be observed bilaterally, that is, not only from language to action but 
also vice versa (see Meteyard and Vigliocco 2008). 

The current study is aimed at testing whether: (a) motor information concerning 
objects can be pre-activated through the presentation of images of graspable objects 
as primes (e.g., “frying pan”); and (b) pre-activated motor information concerning 
graspable objects can affect performance on a lexical decision task involving target 
words describing objects’ properties relevant for action (e.g., handle). 

To this end, participants were instructed to observe a prime object that could be 
presented in two different orientations, that is, with the action-relevant component 
(e.g., the frying pan’s handle) oriented either toward the left or toward the right. 
They were then asked to perform a lexical decision task (LDT)—a task commonly 
used in studies on lexical-semantic processing (Meyer and Schvaneveldt 1971; see 
also Iani et al. 2009; Scerrati et al. 2017)—on a subsequent target word. Specifically, 
they were required to judge whether the following target was a known word in the 
Italian lexicon or not by pressing a key either on the same side as the depicted action- 
relevant property of the prime object (i.e., spatially compatible key) or on the opposite 
side (i.e., spatially incompatible key). Target words matching in frequency and length 
were of three different types: words describing properties relevant for action with the 
object (action-relevant words, e.g., handle); words describing properties irrelevant 
for action with the object (action-irrelevant words, e.g., ceramic); words describing 
things unrelated to the object (unrelated words, e.g., eyelash). 

If the image of the graspable object (i.e., the prime image) directly cues a specific 
motor representation, which becomes part of the concept held in working memory 
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(e.g., Bub and Masson 2010), then we should observe a facilitation on the subse- 
quent lexical decision task provided that the target word is action-relevant (e.g., 
handle) and the orientation of the action-relevant component of the prime object is 
spatially compatible with the response key. Indeed, several behavioral studies showed 
a facilitation when the responding hand of the participant and the orientation of the 
object’s graspable component, that is, its affordance (e.g., the handle; for the original 
idea of affordance see Gibson 1979) were compatible (i.e., on the same side) rather 
than incompatible (i.e., on opposite sides). This finding supports the assumption that 
seeing a picture of a graspable object activates the motor actions associated with its 
use (Jani et al. 2019; Pellicano et al. 2010; Saccone et al. 2016; Scerrati et al. 2019, 
2020; Tipper et al. 2006; Tucker and Ellis 1998; Vainio et al. 2007). Therefore, we 
expect that the presentation of the graspable prime object will pre-activate manipula- 
tion information about objects. This in turn should facilitate a lexical decision task on 
target words describing those objects’ properties relevant for action (e.g., handle). 
In contrast, no such facilitation is expected for target words that describe proper- 
ties irrelevant for action with (action-irrelevant words, e.g., ceramic) or unrelated to 
(unrelated words, e.g., eyelash) the prime object. In other words, we expect that motor 
information evoked by object observation will have different effects as a function of 
the following type of word. Specifically, we predict that motor information will deter- 
mine a motor-to-semantic priming effect for action-relevant words as the processing 
of these words can benefit from the activation of motor knowledge. Conversely, it 
should determine neither benefits nor disadvantages for action-irrelevant and unre- 
lated words as these words refer to motor-irrelevant features of the prime objects. 
Hence, we expect to observe an interaction between spatial compatibility and the 
type of word. 


2 Method 


2.1 Materials 


The prime stimuli were digital photographs of four domestic objects (can, door, 
frying pan, radiator) selected from public-domain images available on the Internet. 
Prime objects could be presented in two orientations, that is, with the action-relevant 
component (e.g., the frying pan’s handle) oriented either toward the left or toward the 
right. These objects subtended a maximum of 13.7° of visual angle horizontally and 
12.3° of visual angle vertically when viewed from a distance of 60 cm. Prime objects 
were centered on screen according to the length and width of the entire object. 

The target stimuli were twelve words belonging to three different categories: Four 
words referred to a characteristic of the prime object that was relevant for action (e.g., 
handle); four words referred to a characteristic of the prime object that was irrelevant 
for action (e.g., ceramic); four words referred to things unrelated to the prime object 
(e.g., eyelash). For the complete list of stimuli, see Appendix. Target words ranged 
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Table 1 Psycholinguistic matched variables of the target words used in the main experiment 


Action-relevant words Action-irrelevant words Unrelated words 

Mean SD | Range | Mean SD | Range | Mean SD | Range 
Frequency | 7.7 3.7 | 3-12 | 21.2 11.5 | 11-34 | 26 40.7 | 2-87 
Length Td 1.2 | 6-9 8 2.1 | 5-10 | 7.2 0.9 | 6-8 


from 2.7 to 5.4 cm (from 5 to 10 characters) which resulted in a visual angle range 
between 2.5° and 5.1° when viewed from a distance of 60 cm. 

Words from the three categories (action-relevant, action-irrelevant, and unrelated) 
were matched in terms of frequency and length. For lexical frequency, the Italian 
database Colfis was used (Bertinetto et al. 1995). Values for frequency and length of 
target words are reported in Table 1. 

To control for association strength between the prime object and the target word, 
40 Italian participants (23 males; mean age: 28 years old; SD: 9 years) who did not 
participate in the main Experiment were asked to rate the twelve target words in 
terms of their degree of association with the prime objects on a 1-7 points Likert 
scale (1 = “not associated at all”; 7 = “very associated”). The mean ratings were 5.2 
for action-relevant words related to the prime object, 5.4 for action-irrelevant words 
related to the prime object, and 1.5 for words unrelated to the prime object. 

Twelve legal non-word fillers (e.g., celimora) were created using a non-word 
generator for the Italian language available online.' The non-words were preceded 
by the same prime objects. 

To control for potential phonological associations between the non-word fillers 
and the target words, 28 new Italian participants (11 males; mean age: 27 years old; 
SD: 7 years) were engaged in a free association production task. The task required 
participants to write down the first two Italian words that each of the twelve non-words 
brought to mind. Only one participant reported the Italian word ciglia (included in 
the unrelated category) in response to the non-word geglie. However, given it was 
an isolated case, we did not consider it necessary to exclude this non-word from our 
selection of non-word fillers. 


3 Participants 


Thirty-four participants (13 males; mean age: 22 years old; SD: 3 years) from the 
University of Modena and Reggio Emilia where the experiment was conducted. All 
participants were native speakers of Italian, had normal or corrected to normal vision, 
and were naive as to the purpose of the experiment. Handedness was measured by the 
Edinburgh Handedness Inventory (Oldfield 1971), which revealed that 25 participants 
were right-handed (laterality mean = 0.76; SD = 0.13), seven participants were 


‘https://www.trainingcognitivo.it/GC/nonparole/. 
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ambidextrous (laterality mean = 0.25; SD = 0.21) and two participants were left- 
handed (laterality mean = —0.69; SD = 0.10). The experiment was conducted in 
accordance with the ethical standards laid down in the Declaration of Helsinki and 
fulfilled the ethical standard procedure recommended by the Italian Association of 
Psychology (AIP). All procedures were approved by the Department of Education 
and Human Sciences of the University of Modena and Reggio Emilia where the 
experiment was conducted. All participants gave their written informed consent to 
participate to the study. 


4 Apparatus 


Stimulus presentation, response times (RTs) and accuracy were controlled and 
recorded by E-Prime 2 (Psychology Software Tools, Inc., Sharpsburg, PA). Partici- 
pants completed the experiment on a HP ProDesk 490 G1 MT running Windows 7 
with a 19 in monitor and a display with a resolution of 1280 x 1024 pixels. 


5 Design and Procedure 


Two factors were manipulated: Target word with 3 levels (action-relevant; action- 
irrelevant; unrelated), and Spatial compatibility—between the orientation of the 
action-relevant component of the prime object and the response—with two levels 
(spatially compatible: both handle and response on the right or on the left; spatially 
incompatible: handle on the right and response on the left and viceversa). Both factors 
were manipulated within-subject. 

Participants sat at a viewing distance of about 60 cm from the monitor in a dimly- 
lit room. Each trial started with the presentation of a fixation cross (0.3 cm x 0.3 cm) 
for 500 ms. Immediately after the fixation, the prime object appeared on screen for 
1000 ms. Then, either the target word or the non-word filler was displayed on screen 
until a response was given or until 1500 ms had elapsed (see Fig. 1 for details). RT 
latencies were measured from the onset of the target stimulus. Both target and filler 
stimuli were bold lowercase Courier new 18 and were presented in black in the center 
of a white background. 

Participants were asked to make a lexical decision, that is, determine whether the 
displayed letter string was an Italian word or not, by pressing one of two lateralized 
buttons as quickly and as accurately as possible. Response keys were the “-” and the 
“z” keys on an Italian QWERTY keyboard. Half of the participants responded by 
pressing the “-” key with their right index finger when the letter string was an Italian 
word, and the “z” key with their left index finger when it was a non-word. The other 
half was assigned to the opposite mapping. 

The order of presentation of each prime-target pair was randomized across partic- 
ipants. The experiment consisted of 24 practice trials (different from those used in 
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500 ms 1000 ms 1300 ms 


+ z handle 


Fig. 1 Illustration of an action-relevant target word in the spatially compatible condition. In the 
example above instructions required to respond with the left index finger to words and with the 
right index finger to non-words. Note that elements are not drawn to scale 


the experiment) and two experimental blocks of 48 trials each, for a total of 120 trials 
per participant. Blocks were separated by a self-paced interval and the experiment 
lasted approximately 10 min. 


6 Results 


Responses to non-word fillers were discarded. Omissions (1%) and outlying RT (5%) 
that were two standard deviations (SD) from the participant’s mean were excluded 
from the analysis. 

Two repeated measures ANOVAs with Target Word (action-relevant, action- 
irrelevant, unrelated) and Spatial compatibility (compatible, incompatible) as within- 
subject factors were conducted, one for RT latencies and one for percentage errors 
(3.5%). When sphericity was violated, the Huynh—Feldt correction was applied, 
although the original degrees of freedom are reported. 

The results of the ANOVA on the RT latencies did not reveal any significant main 
effect or interaction, all F < 1. In contrast, the results of the ANOVA on the percentage 
errors showed a significant main effect of Target Word (F(2, 66) = 3.67, MS, = 
61.15, p = 0.043, iy = 0.10), that is, lexical decision responses were more accurate 
for action-relevant target words (1.65%) than for both action-irrelevant (4.22%) and 
unrelated target words (4.59%), t(33) = 2.92, p = 0.006, and t(33) = 2.61, p = 0.01, 
respectively. No other main effect resulted significant, F < 1. Results are shown in 
Fig. 2. 

Importantly, there was a marginally significant interaction between Target Word 
and Spatial compatibility (F(2, 66) = 3.42, MS. = 35.68, p = 0.057, np” = 0.09). 
Paired comparisons revealed that lexical decision responses for action-relevant target 
words tended to be more accurate in the spatially compatible condition (0.73%) than 


160 E. Scerrati et al. 


CS 

= 5 

£4 

a] 

& 

@ a 

o 3 

of 

=, 

a 2 

o 

4 

© 1 | 

a 

1,6 4,2 
Action-relevant Action-irrelevant m Unrelated 


Fig.2 Mean lexical decision percentage errors as a function of target word (action-relevant; action- 
irrelevant; unrelated): bars indicate standard errors 
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Fig.3 Mean lexical decision percentage errors as a function of target word (action-relevant; action- 


irrelevant; unrelated) and spatial compatibility (compatible; incompatible): bars indicate standard 
errors 


in the spatially incompatible condition (2.57%), t(33) = 1.71, p = 0.09 two tailed. 
In contrast, lexical decision responses for action-irrelevant target words tended to be 
more accurate in the spatially incompatible condition (2.94%) than in the spatially 
compatible condition (5.51%), t(33) = —1.74, p = 0.09 two tailed. Finally, lexical 
decision responses for unrelated target words did not differ in the spatially compatible 
(4.41%) and incompatible (4.77%) conditions. Figure 3 shows the results graphically. 


7 Discussion 


Although much evidence is available on the influence of semantics on action prepa- 
ration and execution (Gentilucci and Cangitano 1998; Glenberg and Kaschak 2002; 
Glover et al. 2004; Lindemann et al. 2006; Myung et al. 2006; Zwaan et al. 2002), 
the effects of motor control on language processing are poorly investigated. 
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The current study examined whether semantic processing may be influenced by 
the activation of the motor system. If conceptual content is stored closed to the 
sensory and motor systems (Warrington and McCarthy 1983, 1987; Warrington and 
Shallice 1984; see also Damasio 1989; Farah and McClelland 1991; Humphreys and 
Forde 2001; McRae and Cree’s 2002), and if it shares a common neural substrate 
with the sensory and the motor systems (Barsalou 1999, 2008, 2016), then effects of 
language on action and of action on language should be observed likewise (Meteyard 
and Vigliocco 2008). 

We explored whether presenting images of graspable objects (e.g., “frying pan”) 
as prime stimuli could pre-activate manipulation information about objects, which 
in turn could facilitate a lexical decision task on target words referring to objects’ 
properties relevant for action (e.g., handle). That is, we expected that object observa- 
tion would activate motor knowledge leading to a motor-to-semantic priming effect 
only for target words referring to action-relevant components of objects as only 
the processing of action-relevant words should benefit from the activation of motor 
knowledge. 

In line with our hypothesis, we found that performing a lexical decision on action- 
relevant target words produced more accurate responses than performing the same 
task on action-irrelevant words and on words unrelated to the prime objects. This 
finding suggests that language processing is somewhat facilitated provided that words 
are not only related to the prime object seen before but also relevant for action with that 
object. It is plausible to assume that the prime object’s graspability was able to shift 
participants’ attention to the action-relevant features of the object thus facilitating 
the subsequent lexical decision on words describing those features. 

Furthermore, we found an interaction between the type of word (relevant- 
for-action; irrelevant-for-action; unrelated) and spatial compatibility (compatible, 
incompatible). In line with our hypothesis, we observed a tendency toward lower 
percentage errors (i.e., facilitation) when the target word was action-relevant (e.g., 
handle) and there was spatial compatibility between the orientation of the action- 
relevant component of the prime object and the response. Conversely, we observed 
a tendency toward higher percentage errors (i.e., interference) when the target word 
was action-irrelevant (e.g., ceramic) and there was spatial compatibility between the 
orientation of the action-relevant component of the prime object and the response. 
Therefore, motor information activated by observing objects’ orientation may influ- 
ence language processing to the extent that words being processed are relevant for 
action with such objects. This preliminary finding supports the assumption that 
observing a graspable object activates the motor actions associated with its use (Iani 
et al. 2019; Pellicano et al. 2010; Saccone et al. 2016; Scerrati et al. 2019, 2020; 
Tipper et al. 2006; Tucker and Ellis 1998; Vainio et al. 2007). 

Taken together these findings suggest that the activation of motor information 
may affect semantic processing. 

However, the present study has a limitation in that our results only emerged for 
percentage errors (not response latencies). This may be the consequence of the low 
level of verbal processing involved by the lexical decision task. Indeed, the LDT 
may recruit the semantic system to a small extent (see Scerrati et al. 2017) thus 
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failing to show a robust influence of motor information on language processing that 
is able to affect response latencies (for task-dependent influences of motor informa- 
tion on conceptual processing see De Bellis et al. 2016; see Garcia and Ibáñez 2016 
for review). That is, if the LDT is performed by relying on a simple word associa- 
tion strategy, i.e., without determining the type of association between the property 
word and the concept word (for example, whether the property word refers to a part 
of the concept word as in the concept-property pair frying pan-handle), then the 
underlying conceptual representations may not be retrieved at all, this resulting in 
motor information being unable to exert a robust influence on semantic processing 
(e.g., Solomon and Barsalou 2004). In addition, as highlighted by a recent review by 
Garcia and Ibáñez (2016), the allowed time-lag (2.5 s) between motor and linguistic 
information may have played a role in our study leading to a weaker influence of 
motor knowledge on language processing. Such weakened influence may reflect in 
the motor-to-semantic priming effect failing to show for response latencies. Even 
holding these caveats in mind, our study indicates a possible influence of motor 
control on cognitive functions and strengthens the hypothesis of the proximity of 
language and sensory-motor systems in the human brain (see also Goldstone and 
Barsalou 1998). 

Future studies may extend the investigation of mutual effects of semantic content 
and motor control by introducing other tasks that more explicitly require the construc- 
tion of modality-specific representations (e.g., motor representations). In fact, it is 
plausible that a conceptual, recognition-oriented task may reveal effects of motor 
control on semantic processing more easily than a more implicit task such as the 
lexical decision task. A different task will help identify to which extent the nature 
of the task determines the motor-to-semantic priming effect and to discard other 
possible factors. 
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Appendix 

Prime Target words Non-words 

objects Action-relevant | Action-irrelevant | Unrelated 

calorifero | manopola knob | ghisa panchina | agraccia bucconede | celimora 
radiator cast iron bench 

lattina linguetta alluminio astuccio | conichia fangialle | geglie 
can tab aluminium case 

padella manico ceramica corredo | ghipi naseco mezecolo 
frying handle ceramic dowry 

pan 


(continued) 
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(continued) 
Prime Target words Non-words 
objects Action-relevant | Action-irrelevant | Unrelated 
porta maniglia compensato ciglia ommibicio | rinchite sobbeme 
door door-handle plywood eyelash 
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1 Introduction 


Over the past recent years, different works connected over the idea that language, 
cognition, and bodily experience must be considered as inextricably intertwined 
areas of research (Gallese and Lakoff 2005; Lakoff and Johnson 1999). A consistent 
number of multidisciplinary studies showed that sensory-motor information influ- 
ences our cognitive structures and thus represents a primary source in the operation of 
meaning construction (amongst others, Martin and Chao 2001; Pulvermüller 2005). 

In this large frame, action verbs play a pivotal role. They are recognized as primary 
tools both in the linguistic encoding of bodily knowledge and in the linguistic repre- 
sentation and modeling of a wide array of highly abstract concepts (Panunzi and 
Vernillo 2019). Action verbs are mainly used to encode very different types of action 
events and bodily schemas. Their semantic extension allows us to refer to a myriad of 
experiences, affordances, bodily movements, and relations between physical objects 
(i.e., primary variation). Moreover, these predicates are pervasively used to encode 
a large and complex array of abstract concepts and figurative meanings (i.e., marked 
variation), for whose labeling they coherently re-use their rich action imagery. The 
class of action verbs represents a case of exceptional interest within the verb lexicon 
category. These verbs are not only among the primary words of children’s vocabulary 
(Tomasello 2003) but they are also among the most common tools in oral commu- 
nication, having an even more significant weight than nouns in spoken language use 
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(Gagliardi 2014; Moneglia 2014a, b). Moreover, and differently from other predi- 
cates, action verbs directly anchor, on the level of language, the domain of sensory- 
motor experience to that of highly abstract thought. Therefore, the analysis of these 
predicates’ semantic variation may ease the understanding of how spatial and bodily 
information (spatial vectors, motion patterns, force dynamics) is mapped to make 
new and non-literal meanings emerge. 

This work,! whose primary research field is that represented by Cognitive Linguis- 
tics and Semantic Theory, starts from the hypothesis that there exists a sort of hidden 
relation between the two dimensions of use and meaning of a given action verb (.e., 
primary and marked variation), and that it there also exists a sort of correspondence 
between the type of action and metaphorical concepts which can be expressed by 
means of the same predicate. The main research questions this study has been built 
upon can be spelled out as it follows: 


1. What are the relationships between the concrete (i.e., primary variation) and the 
metaphorical (i.e., marked variation) uses of a given action verb? 

2. Which semantic features of the action verb determine the metaphorical potential 
of the verb? And how do we determine which action verb can allow us to access 
which metaphorical concept or figurative meaning? 

3. Finally, how can we explain divergent metaphorical potentials of action verbs 
involved in the encoding of the same type of action events (i.e., locally equivalent 
verbs)? 


It is worth to bear in mind that these questions are not only relevant with respect 
to my research field (i.e., Linguistics), but are closely connected to the three main 
research questions this volume starts from (Bechberger and Liu, this volume): 


a. On the representation level: how can we formally describe and model concepts 
(Farber, Svetashova, and Harth, this volume; Gust and Umbach, this volume)? 
And, more specifically, how do we use characteristics of action concepts to 
formally model more abstract ones? 

b. On the learning level: where do concepts come from and how are they acquired 
(Bechberger and Kiihnberger, this volume)? And, in particular, how do we 
transfer sensory-motor information to new and more abstract contexts? 

c. On the application level: how are concepts used in cognitive tasks (Gega, Liu 
and Bechberger, this volume; Scerrati, Iani, and Rubichi, this volume; Schneider 
and Niirnberger, this volume)? And, with respect to the present research, how do 
we apply our action-knowledge in linguistic contexts where no physical action 
is implied? 


To give all these questions an answer, in this study, I aim at investigating the 
semantic variation of a small group of Italian action verbs (ita., premere, spingere, 
tirare and trascinare; Eng., to press, to push, to pull, and to drag) involved in the 
encoding of the force-dynamics category (Langacker 1987). Although the four verbs 


'This research partially bases on two previous works (Panunzi and Vernillo 2019; Vernillo 2019) 
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in analysis belong to the same semantic class (i.e., force), they profile different types 
of action concepts and events. It seems reasonable to believe that the specific image- 
schematic features associated with their action imagery influence the differences in 
their semantic extension and linguistic use. Nevertheless, along the semantic axis 
(i.e., primary and marked variation), there also exist specific points where the uses of 
these verbs tend to converge. For instance, happens that, in some specific pragmatic 
contexts, the uses of premere converge with those of spingere (e.g., setting rela- 
tionships between objects), or the uses of trascinare converge with those of tirare 
(e.g., the frictional motion of an object along a surface). These verbs show a partial 
convergence (or divergence) not only with respect to their primary variation (i.e., 
when encoding physical concrete meanings), but also with respect to their marked 
variation (i.e., when encoding figurative meanings). For example, there are cases in 
which the verb premere (Eng., to press) and the verb spingere (Eng., to push) refer to 
the same type of metaphorical concept (e.g., PSYCHOLOGICAL FORCES ARE PHYS- 
ICAL FORCES), or cases in which the verb tirare (Eng., to pull) and trascinare (Eng., 
to drag) encode the same type of conceptual metaphor (e.g., CAUSES ARE FORCES 
AFFECTING MOTION). 

The present study bases on the idea that a deep analysis of the action imagery 
associated to these predicates can help us to shed new light on their behavior in 
metaphorical contexts. To support this idea, in the following paragraphs, I will 
describe the semantic variation of each of the four verbs, mainly focusing on the 
salient image-schematic structures and the specific action schemas that characterize 
the primary core of the verbs. Additionally, I will explain how these same structures 
and schemas permit to bond together the marked (i.e., largely metaphorical) and 
the primary variation of the verbs (Lakoff 1990, 1993; Turner 1991). In Sect. 2, I 
will present the ontological infrastructure within which my analysis was developed 
with. In Sect. 3, I will give a general overview of the theoretical approaches (i.e., 
Conceptual Metaphor Theory, Image Schema Theory, and Embodiment) that mainly 
influenced my approach to the analysis of action verbs. In Sect. 4, I will present 
the collection of data and the methodology I used for these predicates’ annotation. 
Section 5 will describe the primary variation of each of the four verbs, and it will 
be mainly focused on the salient image-schematic structure and the specific action 
schemas that characterize the primary core of the verbs. In Sect. 6, I will illustrate the 
marked variation of the predicates, and I will explain how the same structures and 
schemas highlighted in the primary variation of the four verbs permit the bonding 
of the marked (or largely metaphorical) and the primary variation of the verbs (see 
the Invariance principle: Lakoff 1990, 1993; Turner 1991). In Sect. 7, I will briefly 
discuss the results obtained by comparing primary and marked variation of the four 
predicates in the analysis. First, I will show that the results of the study are consistent 
with the idea that metaphorical extensions of action verbs are constrained by the 
image-schematic structures involved in the core meaning of the verbs. Second, I will 
point out that these same structures are also responsible for the divergencies found 
within the metaphorical variation of action verbs pertaining to the same semantic 
class (i.e., force). Finally, in Sect. 8, I will draw some general conclusions about the 
type of study that I proposed. 
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Table 1 Visual representation of action concepts in IMAGACT 


Action type Scene Standardization | Verb LEV 


Setting relations 
between two 
objects 


Marta preme il | Premere Spingere 
coperchio sulla |(Eng., to press) | (Eng., to push) 
scatola 

(Eng., Mary 
presses the lid 
onto the box) 


2 The Semantic Representation of Action Verbs 
in the IMAGACT Ontology 


The semantic characterization of action verbs given in this study owes a great deal to 
the representation of action events and concepts the IMAGACT Ontology was built 
upon. This is why, the following paragraphs will be devoted to the general description 
of the Ontology (Sect. 2.1), and the definition of the notion of primary (Sect. 2.2) 
and marked variation (Sect. 2.3). 


2.1 The Internal Structure of the IMAGACT Ontology 


IMAGACT is a multimodal and multilingual ontology that depicts action via a visual 
representation system. The choice to represent action concepts by using both proto- 
typical 3D animations and brief videos (Moneglia 2014a, b; Panunzi et al. 2014) 
stemmed from two needs: first, to avoid the vagueness of semantic definitions, and 
second, to have a resource that could have disentangled action categorization from a 
specific language representation (Brown 2014). 

IMAGACT includes more than 1000 distinct action scenes that have been 
primarily derived from the annotation of spoken language corpora in English and 
Italian. While in a preliminary phase of the project, Chinese and Spanish data were 
also processed, extensions to (Syrian) Arabic, Danish, German, Hindi, Japanese, 
Polish, Portuguese, and Serbian were made available on the online interface? only 
recently. 

The visual representation of the action concepts is organized as the following: 
each prototypical scene is linked to a single action concept (or action type), each 
action verb is connected to more than one prototypical scene, and each prototypical 
scene is associated to more than one action verb. Some action verbs share a common 
referent (or a subset of action scenes) and are hence called locally equivalent verbs 
(e.g., to push and to press). The following Table 1 gives a brief schematization. 

Concerning the present analysis, the IMAGACT framework represents an impor- 
tant point of reference for the investigation of action verbs semantics. First, the 


?https://www.imagact.it/imagact/query/dictionary.seam. 
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Ontology contains a consistent amount of data that has been massively taken from 
multiple spoken resources (e.g., IMAGACT and BNC corpora). Second, this resource 
provides a well-structured visual categorization of action concepts and of bodily 
schemas encoded by general action verbs which are most used in everyday language. 
Third, it permits to hook the linguistic representation of the highly abstract concepts 
(and of the figurative meanings) encoded by a given action verb to the very inherent 
semantic core of the verb. Finally, it eases the interpretation of the variation axes 
of the predicate (i.e., primary and marked variation), since they are jointly consid- 
ered rather than entirely separate dimensions of the lexical item. The rich semantic 
information included in the database helped to better structure the annotation of the 
metaphorical uses of the action predicates. Moreover, it helped to expand the number 
of details that have been used to show that either metaphorical and physical uses of 
action verbs are not randomly produced, but that they both refer to crucial motor and 
perceptual inputs coming from our cognitive and actual representation of actions. 


2.2 The Primary Variation of Action Verbs 


Within the IMAGACT Ontology, the semantics of action is described as based 
two main axes of variation: the primary and the marked variation. Importantly, the 
resource keeps the verb occurrences of the two types of variation well distinct. The 
procedure via which metaphorical and phraseological usages are separated from 
those strictly referring to physical actions is made possible through the adoption of 
an operational test a la Wittgenstein (Gagliardi 2014). According to this test, the verb 
uses are judged primary if it is possible to point to a certain (perceptible) event and 
says to someone who does not know the meaning of a given verb that “this action 
and similar events are what we refer to with this verb”; contrarily, the occurrences 
that do not instantiate the basic meaning of the verb are tagged as marked. 

Within the ontology, the expression primary variation refers to the set of different 
action types to which a given action verb can refer in its proper sense (or concrete 
physical meaning). To illustrate this point, some of the possible physical uses of the 
Italian action verb spingere (e.g., to push) are considered: 


(1) “Marta spinge il pulsante” 
“Marta pushes the button” 

(2) “Marta spinge il coperchio sulla scatola” 
“Marta pushes the lid onto the box” 

(3) “Marta spinge il carrello lungo il corridoio” 
“Marta pushes the cart down the hall” 

(4) “La nuotatrice si spinge con le gambe” 


“The swimmer pushes herself off of the wall with her legs” 


All the listed examples (1—4) are recognized as instantiations of the primary meaning 
of the verb spingere (Eng., to push). The semantics of the predicate is shown in all 
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its complexity while encoding different linguistic and cognitive traits. In examples 
(1-2), the verb can be substituted by the same locally equivalent verb (e.g., premere), 
even though the scenes refer to two action types: in the former case, the verb describes 
the application of force on an object to activate a connected device; in the latter case, 
the verb describes a situation in which a human agent applies a force to set a relation 
between two entities. In examples (3-4), the verb spingere cannot be substituted 
by the same locally equivalent verb: the meaning of case (3) cannot be encoded by 
another predicate. The event in case (4) cannot be named by a single verb but only 
by a more complex syntactic structure, such as ‘darsi una spinta’ (Eng., ‘to give 
yourself a push’). Moreover, the examples in (3—4) describe two types of motion 
event in the physical space. In (3), the verb names an event in which a human agent 
causes an object to move along a path (caused motion). In example (4), the verb 
encodes an event in which a human entity moves spontaneously along a path without 
the intervention of an external force (self-propelled motion). 


2.3. The Representation of the Marked Variation of Action 
Verbs 


The term marked variation refers to the set of uses in which the action verb does 
not encode physical concepts but abstract/figurative (Moneglia et al. 2012; Panunzi 
and Moneglia 2004). Let us consider the following four sentences, which partially 
exemplify the variation of the verb spingere (e.g., to push): 


(5) “L'oratore spinge sui temi sociali” 
“The speaker is pressing on the social agenda” 
(6) “La situazione spinge il Consiglio a intervenire” 
“The situation is pushing the Council to intervene” 
(7) “L'autore ama spingere i suoi personaggi” 
“The author likes to push his characters” 
(8) “La situazione si spinge verso l’anno successivo” 


“The situation will extend onto the next year” 


The sentences in (5—8) do not instantiate the basic meaning of the verb spingere 
(Eng., to push). These examples are based on different semantic processes (mostly 
metaphorical). Thereby the verb undergoes a semantic shift; it has thus been used 
to express different kinds of metaphorical meanings. In particular, the predicate 
represents a situation in which a speaker conveys a specific communicative intention 
(5), implies an act of psychological influence (6), defines the artistic manipulation 
put in place by an author (7), or names the time extension in the duration of an event 
(8). 

As stated above, marked uses are sharply separated from the occurrences referring 
to concrete physical actions and annotated in a different online interface. Unfortu- 
nately, although an ad-hoc infrastructure was designed to classify the marked uses 
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found in the variation of the action verbs (Brown 2014), the IMAGACT ontology 
only specifies the semantic interpretations of predicates with respect to their physical 
actions: hence, other kinds of interpretations are ignored and are not visually repre- 
sented. The lack of a clear depiction of marked uses is not connected in any way to 
their semantic load within the infrastructure (they represent half of the IMAGACT 
database occurrences). This problem must be rather explained by reference to the 
visual format of the ontology, which makes it not easy to represent abstract concepts 
(Brown 2014). 


3 Body, Metaphors, and Metaphorical Projections of Image 
Schemas 


The analysis focuses on two essential aspects: first, the action verbs semantics and, 
second, the particular role played by action bodily information. In the following para- 
graphs, I will give a brief overview of the main theoretical scenarios my analysis has 
been developed with. Before going through a proper analysis of the semantic variation 
of action verbs, three fundamental frameworks need to be illustrated: in Sect. 3.1, 
I will present the main tenets underpinning the embodied theory of language. In 
Sect. 3.2, I will introduce the key points behind Lakoff and Johnson’s Conceptual 
Metaphor Theory. In the final Sect. 3.3, I will focus on the Image schemas Theory, 
as well as its role within a deep level language analysis. 


3.1 The Embodied Paradigm 


Recently, interest has grown in the idea that language and cognition should be inves- 
tigated with respect to the deep relationship to bodily experience (Aziz-Zadeh and 
Damasio 2008; Desai et al. 2011; Gallese and Lakoff 2005; Kiefer and Pulvermiiller 
2011; Martin and Chao 2001; Pulvermiiller 2005). Embodied cognition theories are 
based on the assumption that between the level of cognitive processes (action and 
perception) and abilities (abstract thought and language comprehension) there is no 
defined boundary or sharp separation (Zipoli Caiani 2011). Accepting that not only 
the brain but also features of the agent’s body play a significant role in cognitive 
processing means to embrace the idea that our entire conceptual system is largely 
constrained by the kind of body and sensory-motor processes we are characterized by 
as humans. The body emerges as a crucial locus and represents a functional restraint 
that imposes its structure on different domains of human experience (Zipoli Caiani 
2011). But what does it mean to embrace the embodied paradigm when it comes 
to language? The embodied approaches emerged in response to the cartesian (or 
cognitivist) paradigm. According to this paradigm, the brain is viewed as a processor 
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of abstract information, while cognition should be defined as the computation of 
abstract symbols that the language is made of (Varela 1991). 

Contrarily, embodied theories (Barsalou 2008, 2016; Johnson 1987; Lakoff 1987; 
Lakoff and Johnson 1980, 1999; Wilson 2002) argue that reasoning, concepts, and 
language are grounded in experience and tightly bonded to the body and its specific 
features. In this framework, it is claimed that the body and its inherent way of 
functioning and interacting in the physical space, directly impinge on our cognitive 
structures, and it thus represents one of the primary sources in the operation of 
meaning construction (Lakoff and Johnson 1999). A consistent number of empirical 
studies indeed showed that conceptual knowledge is deeply rooted in perceptual and 
motor systems (Gallese and Lakoff 2005; Martin and Chao 2001; Pulvermiiller 2005). 
Additionally, it was shown that sensory-motor simulations directly impinge on the 
processing and understanding of language (Glenberg and Kaschak 2002; Tettamanti 
2005). 

The adoption of an embodied approach to the study of lexicon relies on the idea 
that bodily properties have a crucial function in meaning construction processes. 
Embodied theories, in fact, directly look at body and language as a tight coupling, in 
which the comprehension of the latter cannot take place without information deriving 
from the former (Gibbs and Colston 1995; Gibbs 2005). As I will show, bodily 
features, sensory inputs, and action-oriented schemas do also play a pivotal role in 
the construction and extension of the action verbs’ meaning, both on the concrete 
and the abstract representation level (Panunzi and Vernillo 2019). This is why, in 
this paper, not only physical but also figurative meanings of action verbs have been 
accounted for by working on the idea that sensory-motor processes can provide us 
with more data on human understanding and representation of concrete and abstract 
concepts. The starting point of the analysis will be that the different semantic layers 
(i.e., primary and marked variation) characterizing the semantic core of action verbs 
should not be viewed as separate dimensions of the lexical meaning but, rather, as 
deeply and strongly connected. 


3.2 Conceptual Metaphor Theory 


The Conceptual metaphor theory (henceforth CMT: Lakoff and Johnson 1980, 1999) 
represents one of the most powerful theories on abstract reasoning. Over the years, 
CMT has benefited from a consistent number of empirical researches which guar- 
anteed, in some way, the reliability of the approach (Casasanto and Bottini 2014; 
Gibbs 2006; Jamrozik et al. 2016). One of the essential claims of CMT is that 
metaphors concern not only the way we use language but also the way we orga- 
nize human thought. In this theoretical scenario, metaphors are not conceived as 
mere rhetoric tropes but rather as cognitive processes, by means of which aspects 
of human cognition, perception, and experience are transposed in language (Lakoff 
and Johnson 1980). CMT can be considered as the most embodied approach to the 
study of language. It is in fact based on the idea that the way we refer to abstract 
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concepts exploits the rich flow of information which we gain from our experience of 
the world and of the way we bodily interact with the world and the objects therein. 
According to Lakoff and Johnson (1980: 115), a large number of concepts that are 
meaningful to us are either abstract or not well delineated in our experience. They 
thus necessitate being conceived via concrete concepts that we can understand in 
clearer terms. The internal structure of many abstract concepts, such as Changes, 
States or Causes, appears to be cognitively grounded in the metaphorical mapping of 
more concrete schemas as, say, FORCE and MOTION (Gibbs 2006; Lakoff and Johnson 
1980, 1999). People talk about state changes in the same way they talk about motion 
changes (e.g., CHANGE OF STATE IS CHANGE OF MOTION), causes in the same way 
as forces (e.g., CAUSES ARE FORCES), or states as physical locations in the space 
(e.g., STATES ARE LOCATIONS). 

Metaphors are based on a conceptual mapping operation that transfers precon- 
ceptual knowledge from one concrete source domain to an abstract target domain 
(Lakoff and Johnson 1980). The information transfer must respect some basic rules 
and is supposed to be constrained by a number of different factors that can enable or 
stop the metaphorization process (Brygida Rudzka-Ostyn 1995). Amongst others, it 
is worth noticing that the mapping is not an exhaustive process, that is, not all but 
just some aspects of the source domain are transferred onto the target domain (see 
partial metaphorical utilization phenomenon in Kévecses 2010). The mapping is 
conditioned by an asymmetrical directionality, according to which the transfer may 
only go from the source to the target domain and not vice-versa (Lakoff and Johnson 
1980). Moreover, the mapping operation must not violate the internal structure of 
the target domain (i.e., target domain override). According to the Invariance Prin- 
ciple Hypothesis (Lakoff and Turner 1989; Lakoff 1990, 1993; Turner 1991), the 
metaphorical mapping must preserve the cognitive topology (or image-schematic 
structure) of the source domain consistently with the inherent structure of the target 
domain. 

As the present work is concerned with the analysis of the semantic variation 
of action verbs, both on the concrete and the abstract level of representation, an 
approach to the study of language, such as proposed by the CMT, can help: (a) To 
better disclose the nature of the relationship that seems to tie up together the primary 
and metaphorical uses of a given action verb; and (b) to investigate the specific 
role that bodily-action information plays within our conceptual system (Panunzi and 
Vernillo 2019). 


3.3. Image Schema Theory 


Image schemas (or schemata) are a key notion in the field of Cognitive Linguistics 
used to tie up together embodied experience, cognition, and language. The early 
notion of the concept dates to the empirical works on spatial relations terms by Talmy 
(1983) and Langacker (1987), but it has been fully developed only a decade later by 
Johnson (1987) and Lakoff (1987). Image schemas have been investigated not only in 
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Cognitive linguistics but in many research fields, amongst others, Psycholinguistics 
(Gibbs and Colston 1995), Developmental Studies (Mandler 1992; Mandler and 
Cánovas 2014), Poetics (Lakoff and Turner 1989), and Neurosciences (Feldman and 
Narayanan 2004; Gallese and Lakoff 2005). 

Image schemas are deemed to be imaginative structures of understanding; by their 
means, we can make sense of our everyday bodily functioning and physical interac- 
tion within the surrounding space. They directly emerge from bodily experience and 
represent a sort of bridge between sensory-motor information and higher cognitive 
functions (Hampe 2005). According to Johnson’s (1987: XIV) traditional definition, 
an image schema is a ‘recurring, dynamic pattern of our perceptual interactions and 
motor programs that gives coherence and structure to our experience’. In the liter- 
ature, the umbrella term image schema has been subject to different interpretations 
and has thus resulted in a large cross-linguistic variation in the use of the term itself 
(Mandler and Canovas 2014; Talmy 1983). Although there is no general agreement 
upon the definition of the concept, there is broad consensus that image schemas 
are characterized by a stable set of recurrent properties (Cienki 1997; Gibbs 2006; 
Hampe 2005; Hampe and Grady 2005; Johnson 1987; Krzeszowski 1993; Lakoff 
1987): 


a. They recur across a large variety of distinct experiences and are not bound to a 
particular context of experience and knowledge; 

b. They are preconceptual primary components, that is, unlike propositions, they 
do not state the truth or other conditions of satisfaction; 

c. They are characterized by having an internal gestaltic configuration (they 
contain a small number of related parts and intended as coherent and meaningful 
wholes); 

d. They tend to be co-experienced together (e.g., superimposition); 

e. They show an orientation towards the positive or negative default evaluation 
when used in metaphorical mappings (plus-minus or axiological parameter); 

f. They have both a static and dynamic nature (they can represent either a state of 

being or processes); 

Their internal structure of image support inferences; and 

They operate beneath the level of our conscious awareness. 


mgs 


A condensed inventory of the image-schematic structures which most frequently 
recur in our experience is provided in Johnson (1987). The list is not conceived as a 
closed set but rather as the result of an informal analysis (or reflective interrogation) 
of the most basic phenomenological features of our every-day experience. Different 
approaches to the identification and categorization of image-schematic structures 
have been proposed by Mandler (1992), Talmy (2000), Mandler and Canovas (2014). 
Beyond the differences between the various taxonomic proposals, some of the most 
frequently cited examples of image schemas are CONTAINMENT, SOURCE PATH 
GOAL, VERTICAL AXIS, FORCE, SUPPORT. 

Image schemas are operative in our perceptual interactions, bodily movements, 
and physical manipulation of objects since early infancy (Mandler 1992; Mandler 
and Cánovas 2014). They are recognized as primitive cognitive components in the 
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development of human thought. These conceptual building blocks encode not only 
spatial? and bodily related information but also play an essential role in the modeling 
of highly abstract concepts (e.g., over in Lakoff 1987 and Brugman 1988; verti- 
cality in Ekberg 1995; straight in Cienki 1998; smooth-rough in Rohrer 2006). 
Skeletal projections of image schemas are transferred from domain to domain through 
analogical reasoning and metaphorical mapping (K6vecses 2010). In the operation 
of metaphorical mapping, image schemas constrain the information transfer in such 
a way to prevent the source domain topology from incoherence or inconsistency with 
the internal structure of the target domain (Invariance principle; Lakoff 1990, 1993; 
Turner 1991). 

Since this linguistic investigation rests on the basic idea that physical experiences 
can be thought of as one of the most important sources that give meaning to conceptual 
structures, my analysis strongly benefited from the adoption of an image-schematic 
approach to the study of language. As I will show in the next paragraphs, the differ- 
ential image-schematic structures characterizing the semantic core of action verbs 
strictly impinge on their extension and, consequently, their metaphorical potential. 
They determine the type of abstract concepts (and figurative meanings) that may or 
may not be conveyed by the action predicates. Against this background, the detec- 
tion of the image schemas operating within the primary variation of action verbs 
helped on two levels of the analysis: first, image schemas may be used to motivate 
the synonymousness relations between two action verbs (e.g., both spingere and 
premere may be used to express the same action concept); and second, to understand 
the divergent or convergent behaviors that two action verbs have when used to encode 
abstract concepts and figurative meanings (e.g., the verbs spingere and premere are 
not always used to convey the same kind of metaphorical concepts). 


4 Data and Methods 


This study aims at investigating the semantic variation of a cohesive group of four 
action verbs that, in their basic meaning, codify the exertion of physical force: 
premere (Eng., to press), spingere (Eng., to push), tirare (Eng., to pull) and trascinare 
(Eng., to drag). The data the analysis was built upon have been primarily extracted 
from the corpus IMAGACT (Moneglia 2014a, b) and later integrated with a larger 
number of occurrences taken from the Opus corpus (Italian subtitles). The annotation 
process started with the scrutiny of more than 5000 occurrences, of which about 1000 
were derived from IMAGACT and around 4000 from the Opus corpus. Interestingly, 
just a small part of the whole collection became the classification core. This means 
that the analysis of the metaphorical production (i.e., marked variation) of the four 
action verbs was only based on 300 metaphorical occurrences. 


7 According to Mandler (1992), Mandler and Cánovas (2014) spatial inputs are recoded in the form 
of image schemas during processes of perceptual meaning analysis and used as primitive conceptual 
components in the development of human thought. 
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With regard to the deep annotation process, it can be spelled out in the following 
three crucial steps: 


1. Overall evaluation of the primary variation of each action verb with the extrac- 
tion of the significant semantic properties with a strong distinctive value (e.g., 
differences in motor schemas, spatial relations, type of object involved, action 
participants); 

2. Examination of the metaphorical uses of each action verb and isolation of the 
metaphorical conceptual structures found within their marked variation (Lakoff 
et al. 1991; MetaNet: Dodge et al. 2013); 

3. Identification of the most salient image-schematic components (Johnson 1987; 
Lakoff 1987; Mandler and Cánovas 2014) for each verb, with respect to its 
primary and marked variation. 


5 Description of the Primary Variation of the Four Action 
Verbs 


The four general action verbs premere (Eng., to press), spingere (Eng., to push), 
tirare (Eng., to pull), and trascinare (Eng., to drag) can be looked at as a cohesive 
semantic class, in which the category of force-dynamics represent the main actor. 
They are, in fact, all used to express the exertion of some kind of physical force on 
the agent’s body, animate theme, or tangible object. To simplify the representation 
of their semantic variations and the isolation of the common and differential traits, 
the presentation has been organized by coupling these verbs in 2 sub-groups: (1) one 
group represented by premere and spingere; (2) the other group represented by tirare 
and trascinare. 

In Sect. 5.1, I will describe the primary variation of premere and spingere, high- 
lighting convergent and divergent points along their axis of variation. In Sect. 5.2, I 
will focus on the description of tirare and trascinare, and I will try to illustrate their 
semantic similarities and differences, when their physical (and concrete) uses are 
considered. 


5.1 The Primary Variation of the Verbs Premere and Spingere 


As locally equivalent verbs, premere (Eng., to press) and spingere (Eng., to push) 
share a common sub-set of action concepts. They are applied in a small range of 
linguistic contexts to encode action events in which an agent interacts with an entity 
by exerting force on it. Interestingly, the entity is not deeply or permanently physically 
affected by the force and, overall, is not moved from one place to another. Both the 
verbs, for instance, are employed in the depiction of action events in which the 
force can result in: (a) An activation of the device connected to the affected entity 
(“Spingere/premere il pulsante”; Eng., “To push/To press the button”); and (b) the 
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establishment of new relations between two or more entities (“Spingere/premere il 
coperchio sulla scatola”; Eng., “To push/To press the lid on the box”). 


5.1.1 The Primary Variation of the Verb Premere 


The equivalence of the verbs premere (Eng., to press) and spingere (Eng., to push) 
is not absolute and their variations do not tend to systematically converge. Besides 
the uses presented above, the verb premere also appears to codify action concepts in 
which the application of force on a specific entity (in the form of physical pressure) 
results in a mere physical manipulation (e.g., “Il fisioterapista preme sulla schiena 
di Maria”; Eng., “The physical therapist presses on Mary’s back”). With regard to 
its inherent image-schematic structure, the verb premere bases on the FORCE schema 
and, unlike spingere, never entails the MOTION schema. The verb premere is mainly 
used to profile static scenarios, that is, to highlight the mere interaction between a 
force and the entity affected by the exertion of the force. Given the prototypical action 
imagery associated with premere, the image-schematic components which appear to 
play a relevant role in its primary variation are: COMPULSION FORCE, CONTACT, 
OBJECT, and BLOCKAGE. 


5.1.2 The Primary Variation of the Verb Spingere 


The verb spingere (Eng., to push) commonly expresses action events in which the 
exertion of force on a concrete entity has the motion as direct entailment. The motion 
can either be instantiated by an external force (e.g., CAUSED MOTION: “Spingere il 
carrello”; Eng., “To push the cart down the hall”) or be spontaneous and not brought 
about by another force (e.g., SELF- PROPELLED MOTION: “Il nuotatore si spinge con 
le gambe”; Eng., “The swimmer pushes himself off of the wall”). Moreover, motion 
can be continuous and controlled by the agent along the overall path (e.g., CAUSED 
JOINT MOTION schema); or it can be discrete and controlled by the agent only in 
the initial phase of the event (e.g., CAUSED MOTION schema). The former MOTION 
schema plays a central role in the construal of those action events in which the agent 
has control of the theme throughout the motion (e.g., “Spingere il carrello”; Eng., 
“To push the chart down the hall”). The latter schema is determinant in those action 
events in which the agent does not experience the overall motion of the theme, and 
in which the motion results in a different spatial agent-theme configuration, such as 
in an increase of the physical distance between the agent and the entity affected by 
the force (e.g., “Spingere la scatola”; Eng., “To push the box away”). As the verb 
structure suggests, the tight association between the FORCE and the MOTION schemas 
is a distinctive feature of the semantic core of spingere. Rather than being used to 
encode events of mere force exertion, the verb spingere is mainly used in the encoding 
of kinetic events, that is, in events involving the shift of the location of the affected 
entity (animate or inanimate). As the prototypical action imagery associated with 
spingere suggests, the image-schematic components which do play a relevant role 
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Table 2 Differential image schemas in the variation of premere and spingere 


Differential image schemas 


Contact Compulsion force Motion Path Object 
Premere + + = = + 
Spingere + + + $ +/— 


in the verb primary variation are: COMPULSION FORCE, CONTACT, OBJECT, PATH, 
and SELF/CAUSED MOTION. 

To give a general overview on the image schemas that I discussed so far, in the 
table below, I present a brief resume of the different components involved in the 
semantic core of the action verbs premere (Eng., to press) and spingere (Eng., to 
push), and I distinguish between salient (+), absent (—), and optional schemas (+/) 
(Table 2). 


5.2 The Primary Variation of Tirare and Trascinare 


When we use tirare (Eng., to pull) and trascinare (Eng., to drag) as locally equivalent 
verbs, we probably want to refer to action events in which an agent exerts a phys- 
ical force (COMPULSION FORCE schema) on a theme (either animate or inanimate), 
such as to forcefully and roughly move it along a surface (CAUSED JOINT MOTION 
schema).* The force can be either directly applied on the affected entity (e.g., “Fabio 
tira/trascina il sacco della spazzatura”; Eng., “Fabio pulls/drags the garbage”) or be 
indirectly applied using an intermediary instrument (“Giovanni tira/trascina la barca 
con l’argano”; Eng., “John pulls/drags the boat onto the beach with the winch”). The 
transfer of the object (e.g., theme) on the terrain does not happen smoothly, but it 
encounters some difficulties which slow down the motion of both entities which are 
involved (e.g., the agent and the theme). The slowing down may be caused by either 
the fact that the theme has a weight that impedes its motion or by the theme’s reluc- 
tance to move along the path (BLOCKAGE schema). Either way, the verbs tirare and 
trascinare profile an action scene in which, at each step of the motion, the agent tries 
to forcefully overcome the resistance produced by the friction between the theme 
and its path along which the theme moves (RESTRAINT REMOVAL schema). 


5.2.1 The Primary Variation of the Verb Tirare 


The verbs tirare (Eng., to pull) and trascinare (Eng., to drag) are tied up in a relation- 
ship of partial synonymy, that is, they are not always applicable in the same linguistic 
contexts. The semantics of the verb tirare is based on a larger array of action events 


4The agent has control of the theme throughout the motion and not only in the beginning phase of 
the force-action event. 
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Table 3 Differential image schemas in the variation of tirare and trascinare 


Differential image schemas 


Contact | Compulsion | Restraint | Blockage | Motion | Path | Object | Surface 
force removal 
Tirare +/— H= +/— H= | + +/— 
Trascinare + + H H + H 


and schemas. In general, the predicate describes action scenes in which the force 
applied may or may not result in events of proper motion. In cases where it does, the 
predicate describes events in which an agent causes an object to move along a path 
(e.g., CAUSED MOTION®). The motion can be performed either along the vertical or 
the horizontal axis, and it is normally supposed to be directed towards the agent or 
towards the effector who applied the force. In cases in which the exertion of force 
does not result in a schema of motion, the predicate is used to profile action events 
involving the mere manipulation or modification of the shape of an object (e.g., 
“Mario tira la corda”; Eng., “Mario pulls the rope”). Given the prototypical action 
imagery associated with the verb tirare, the following image-schematic components 
were isolated: COMPULSION FORCE, OBJECT, CONTACT, PATH, and CAUSED/CAUSED 
JOINT MOTION. 


5.2.2 The Primary Variation of the Verb Trascinare 


The verb trascinare (Eng., to drag) has a primary variation narrower than that of 
the verb tirare, as it is only used to encode action events in which the motion is 
performed in the same agent or effector’s direction (CAUSED JOINT MOTION schema). 
The verb trascinare can also be used to name physical events of SELF- PROPELLED 
MOTION, that is, events in which an animate entity moves along a path spontaneously, 
without the intervention of an external force (e.g., “Fabio si trascina lungo il corri- 
doio”; Eng., “Fabio drags himself along the ground”). In both the cases (CAUSED 
and SELF- PROPELLED MOTION schemas), the predicate encodes action events in 
which the existence of a frictional force influences the specific manner of motion 
along the path (the motion is performed forcefully and roughly). As the analysis 
of the action imagery associated with the verb trascinare suggests, the following 
image schemas are relevant within its semantic core: COMPULSION FORCE, OBJECT, 
CONTACT, PATH, SELF/CAUSED JOINT MOTION, SURFACE, BLOCKAGE, RESTRAINT 
REMOVAL. 

The following table proposes a set of differential image-schematic components 
that allow the better understanding of the application conditions of tirare (Eng., to 
pull) and trascinare (Eng., to drag) and I distinguish between salient (+), absent (—), 
and optional schemas (+/—) (Table 3). 


5Unlike trascinare, the verb tirare does not encode the image schema SELF- PROPELLED MOTION. 
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6 Description of the Marked Variation of the Four Action 
Verbs 


In the previous Sections, it has been claimed that the semantics of general action verbs 
is strongly tied to specific perceptual, spatial, and motor schemas. It has been shown 
that the semantic variation of two similar action verbs (e.g., premere and spingere; 
tirare and trascinare) can partially converge and be responsible for their mutual use 
in the operation of action reference and labeling. However, it has also been pointed 
out that these same verbs can also be applied in diverse pragmatic contexts to express 
diverse types of action events. The question I want to investigate is whether these 
couplings extend the same kind of interwoven semantic relations to their marked 
variations. Their pervasiveness, though, not manifests itself only on the level of the 
reference to concrete actions, but also on a more abstract one, where the semantic 
core is exploited to encode figurative meanings (i.e., marked variation), springing 
from largely metaphorical processes. 

In the following Sections, it will be shown how different semantic properties of 
the predicates connect to a different type of metaphors and metaphorical meanings. 
In particular, in Sect. 6.1, the most significant types of metaphors detected within 
the marked variation of premere (Eng., to press) and spingere (Eng., to push) will be 
analyzed and compared. Finally, in Sect. 6.2, the metaphorical uses of tirare (Eng., 
to pull) and trascinare (Eng., to drag) will be spelled out. The analysis will not 
only consider the conceptual metaphorical structures needed to explain the array of 
abstract uses identified in the verb’s semantics, but it will also identify the image 
schemas that are salient in the operation of metaphorical meaning construction. 


6.1 The Marked Variation of the Verbs Premere and Spingere 


It often happens that the verbs premere (Eng., to press) and spingere (Eng., 
to push) are co-extensively used to linguistically express the same kind of 
metaphorical concepts. Both the verbs are involved in the encoding of the general 
conceptual metaphor PSYCHOLOGICAL FORCES ARE PHYSICAL FORCES, via which 
psychological manipulation (e.g., impact or influence) is understood in terms of 
physical manipulation (e.g., contact or pressure): 


(9) “L'oratore preme sui temi sociali” 
“The speaker is pressing on social agenda” 
(10) “Occorre premere sulle due parti perché il negoziato sia vero” 
“We need to put pressure on the parties to make the agreement true” 
(11) “Bisogna spingere sui processi di liberalizzazione” 
“We need to put pressure on the deregulation processes” 
(12) “Abbiamo spinto affinché tale diritto sia reso più accessibile” 


“We pushed to make this right more accessible” 
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The verbal items in (9-12) exploit our knowledge of the category of force dynamics 
in the representation of the psychological interaction between two entities: the source 
of the force (ANIMATE ENTITY schema) and the party affected by the force (OBJECT 
schema). The sentences (9-12) represent, on the level of language, the projection of 
the abstract domain of psychological forces (e.g., influence) into the concrete domain 
of physical forces (e.g., pressure). 


6.1.1 The Marked Variation of the Verb Premere 


Unlike spingere (Eng., to push), the verb premere (Eng., to press) is often used to 
describe a situation in which the entity exerting the force is perceived as a burdensome 
object (OBJECT schema), weighing on another entity or theme (OBJECT and SUPPORT 
schema) through a sort of imagery contact: 


(13) “La disoccupazione preme sulla spesa sociale” 
“Unemployment weighs on public expenditure” 


Example (13) is a linguistic variation of the metaphor IMPEDIMENTS TO IMPROVING 
ECONOMIC STATUS IS PHYSICAL BURDEN which represents a complex case of the 
primary metaphorical structure DIFFICULTIES ARE IMPEDIMENTS TO MOVEMENT. 
The sentence frames a very specific scene in which unemployment (OBJECT schema) 
is conceived as a social burden or as an obstacle (BLOCKAGE schema) that weighs 
on (COMPULSION FORCE schema) the public spending. More in general, the verb 
premere appears to be pervasively used in the picturing of metaphors that exploit our 
experience of and response to burdens and loads to structure more highly abstract 
domains. In the same way that when I say that “Il tempo preme” (Eng., “Time is 
pressuring me”), I am not referring to the fact that I may eventually change the 
situation in which I am because of the time pressure. I am focusing on the fact that 
another entity (e.g., time) is exerting a psychological force (conceived in terms of 
pressure), that the same entity is affecting my state of mind, and that I may be weighed 
down by the force itself. In similar cases, the direct contact between the source and 
the target entity does result in a sort of burdensome stasis or mere physical pressure, 
without implying a change of state or action of the target entity. This fact can be 
connected to the fact that, as I said in (4.1.1), the action imagery associated with the 
verb premere does not entail the image schema of MOTION. As a consequence, this 
action verb is mainly used to represent static scenarios, that is, to express the mere 
interaction between a force and the entity affected by the force. 


6.1.2 The Marked Variation of the Verb Spingere 


The verb spingere (Eng., to push) rather appears in contexts where the encoding of 
more dynamic metaphorical concepts is based on the source domain of MOTION: 
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(14) “Le circostanze spingono Fabio ad agire” 
“Circumstances are pushing Fabio to act” 

(15) “La situazione si spinge verso l’anno successivo” 
“The situation will press on into the next year” 

(16) “L’amministratore spinge avanti l’ azienda” 


“The manager pushes the company forward” 


The metaphorical extensions presented above (14—16) conceptualize causation in 
terms of motion (either caused or self-initiated). In example (14), external forces 
(e.g., circumstances) are intended in terms of animacy (ANIMATE ENTITY schema) 
and cause (COMPULSION FORCE schema) that a second target entity (e.g., Fabio: 
ANIMATE ENTITY schema) performs an action or adopt a set of actions and, eventually, 
behaviors (e.g., CAUSED MOTION schema). Importantly, this example bases on the 
generalization that caused change of action is conceived as forced motion relative 
to a location. The expression in (14) can be seen as the linguistic reflection of the 
complex conceptual metaphor CAUSED CHANGE OF ACTION IS CONTROL OVER 
AN ENTITY RELATIVE TO A LOCATION, which is an entailment of the metaphor 
CHANGE OF ACTION IS CHANGE OF MOTION. This conceptual structure also makes 
use of the metaphors CAUSES ARE FORCES and CAUSATION IS OBJECT TRANSFER. 
In example (15), an event is seen as a moving entity (ANIMATE ENTITY schema) 
directed from one location in space (SOURCE POINT FOCUS schema) to another 
(END POINT FOCUS schema). The changing that the event undergoes is understood 
as self-initiated motion (SELF- PROPELLED MOTION schema). The example (15) is 
a linguistic variant of the metaphor THE PROGRESS OF EXTERNAL EVENT IS A 
FORWARD MOTION, but may also be understood in a more general metaphorical 
scenario in which TIME is conceptualized as a LANDSCAPE WE MOVE THROUGH and 
ACTION is conceived as SELF- PROPELLED MOTION.’ Finally, example (16) can be 
connected to the conceptual metaphor CONTROL OVER ACTION IS CONTROL OVER 
MOTION, which is a special subcase of the conceptual metaphor PURPOSEFUL ACTION 
IS DIRECTED MOTION TO A DESTINATION (CAUSED JOINT MOTION schema). This 
metaphor also entails the metaphorical structure PROGRESS IS FORWARD MOTION 
ALONG THE PATH. In this and in example (14), causation is intended in terms of 
forced motion relative to a region or a path. The main difference is the fact that 
the metaphorical extension in (16) bases on action imagery slightly different from 
the one found in (14). In this last example, the verb spingere does not only encode 
forced motion (CAUSED MOTION schema) but also the idea that forced motion is 
controlled along the overall path (CAUSED JOINT MOTION schema). An animate and 
forceful entity (e.g., the manager) may have a specific purpose (e.g., the development 
of the company) and may want to guide the target entity (OBJECT schema) that she 
controls (e.g., company) toward the final goal of the long-term, purposeful action 
she is bringing about (END POINT FOCUS schema). 


This metaphor is an entailment of PROGRESS IS FORWARD MOTION ALONG THE PATH. 
TIt is a subcase of the metaphor ACTION IS MOTION ALONG THE PATH. 
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The combination of the FORCE and MOTION schema is also salient in the encoding 
of the orientation metaphorical extensions by the verb spingere. This is in those uses 
in which the predicate expresses the change of a certain value along a measurable 
scale: 


(17) “Tali fattori hanno spinto verso l’alto i prezzi” 
“These factors pushed up the prices” 

(18) “Irincari hanno spinto l’inflazione verso valori superiori al 2%” 
“Price increase pushed the inflation over 2%” 


Both cases (17—18) can be linked to the metaphor CAUSE INCREASE IN QUANTITY IS 
CAUSE UPWARD MOTION, entailment of the more general primary metaphor MORE 
IS UP, and of the metaphor CAUSED CHANGE OF STATE IS CAUSED CHANGE OF 
LOCATION. The metaphorical mapping is built upon image-schematic knowledge: 
while the target domain (e.g., QUANTITY) makes use of the SCALE schema, the source 
domain (e.g., CAUSED UPWARD MOTION) makes use of the combination of the image 
schemas of COMPULSION FORCE, CAUSED MOTION and VERTICAL ORIENTATION. 

Taken together, in all explained examples (14-18), there are two points especially 
interesting for my analysis: first, the category of force systematically intersects with 
that of motion; and second, unlike premere, the verb spingere encodes this constant 
semantic combination in the unravelment of both its primary and metaphorical 
variation. 


6.2 The Marked Variation of the Verbs Tirare and Trascinare 


The metaphorical variation of the verbs tirare (Eng., to pull) and trascinare (Eng., to 
drag) usually converge to the encoding of those conceptual metaphors that construe 
the domain of CAUSATION on the basis of the domains of FORCE and MOTION. 
The two predicates are involved in the linguistic representations of a large system 
of metaphors in which causation is connected to animacy (e.g., CAUSATION IS 
AGENTIVE CAUSATION), causes are intended in terms of force (e.g., CAUSES ARE 
FORCES), changes of state (or of action) are conceptualized as changes of motion 
(e.g., CAUSATION IS CONTROL OVER AN ENTITY RELATIVE TO A LOCATION): 


(19) “Marco tira Luca nella conversazione” 
“Marco involves Luca in the conversation” 
(20) “Il governo non tirera l’ Algeria fuori dal solco in cui si trova” 
“The government will not get Algeria out of its current situation” 
(21) “Ci hai trascinato in mezzo ai guai” 
“You dragged us in a lot of problems” 
(22) “Tl presidente ha trascinato il paese sul fondo” 


“The president dragged the country down” 
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In examples (19-22), the verbs tirare and trascinare are used to depict metaphorical 
scenes in which the change of state of the affected entity is caused by an external 
entity (ANIMATE ENTITY schema). The agent has control over the whole process of 
transition from a state to another (PATH schema), and causes (COMPULSION FORCE 
schema) that the final state or goal achieved by the affected entity is intended in terms 
of motion from one location to another (CAUSED JOINT MOTION schema).® As the 
analysis of the examples shows, there exists an evident correspondence between the 
metaphorical extensions of the verbs tirare and trascinare, and the specific sensory- 
motor imagery associated with these same predicates. All the metaphorical items 
discussed above (19-22) are built upon an operation of conceptual mapping in which: 


a. The agent corresponds to the agent that leads the motion; 

b. The party affected by the new situation or process corresponds to the entity 
(animate or inanimate) moved by the agent along the path; 

c. The caused change of state or situation corresponds to the motion caused by the 
agent; 

d. The achievement of the final goal corresponds to the reach of the final location 
along the path. 


6.2.1 The Marked Variation of the Verb Trascinare 


The metaphorical variation of the verb trascinare (Eng., to drag) diverges from that 
of tirare in many points. The systematic combination of the FORCE and MOTION 
schemas stands as the thread that deeply connects the sets of different metaphorical 
uses produced by the verb. Nevertheless, either the MOTION and the FORCE schemas 
(and imageries) associated with the predicate are richer and more complex than 
those involved in the variation of tirare, as they seem to be more semantically 
constrained. Unlike tirare, the verb trascinare does not simply encode the schema 
of CAUSED MOTION but also that of SELF- PROPELLED MOTION. The verb does 
also require a specific manner of motion (frictional,’ forceful, and difficult). With 
regard to the FORCE schema, the verb trascinare requires that the target entity is 
reluctant or difficult to move (BLOCKAGE schema) and that the force moving the target 
entity (COMPULSION FORCE schema) tries to continuously overcome that physical 
restraint (RESTRAINT REMOVAL schema). The metaphorical items identified within 
the variation of trascinare confirm the saliency of all the semantic aspects discussed 
above (see also Sect. 5.2). The CAUSED MOTION image schema seems to play a 
structural role within the modeling of many metaphorical uses: 


8The change of state can be enriched with additional space information and represented as a motion 
performed along a bounded path (CONTAINER schema) or along the vertical axis (VERTICAL AXIS 
schema). 

°The verb trascinare (Eng., to drag) always implies a sort of friction between the moving entity 
and the ground along which the entity moves. 
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(23) “Gli eventi trascinano la massa” 
“The events are dragging people [along]” 
(24) “Lodio ti trascina” 
“Hatred tugs on you” 
(25) “L'attore trascina il pubblico” 
“The actor drags the audience along” 
(26) “Il tifo non trascina nessuno” 
“The cheer does not grab [lit., drag] anyone” 


Sentences (23—26) profile an extremely unbalanced system of forces, in which one 
entity (an agent, external event, process, or emotion) is conceptualized in terms of 
volition and animacy, and impinges on a second entity’s behavior, state, or action. The 
general conceptual metaphors to which we can relate these examples are the same as 
cited in the previous Section (e.g., CAUSATION IS AGENTIVE CAUSATION, CAUSES 
ARE FORCES, and CAUSATION IS CONTROL OVER AN ENTITY RELATIVE TO A 
LOCATION). What happens to be very interesting here (23—26) is that the verb tirare 
cannot be applied in these same metaphorical contexts to express the same kind of 
metaphorical meaning. The kind of force encoded by tirare does not happen to entail 
the same state of unbalance (and of the unbalanced ratio between the entities and the 
forces involved) that seems to be a salient feature at the base of all the metaphorical 
extensions expressed by trascinare. Unlike tirare, the verb trascinare always entails 
the existence of a sort of impediment to motion and, hence, the presence of a specific 
bodily response to that same impediment: the verb trascinare entails that the motion 
(and the action) is performed with difficulty and that difficulty increases the effort 
needed to accomplish an objective or to reach a goal (e.g., conceptual metaphor 
DIFFICULTIES ARE IMPEDIMENTS TO MOVEMENT).!° For the same reason, the verb 
trascinare is mainly used to encode metaphors that imply a slightly negative meaning. 
The same characteristics discussed so far seem to be relevant to the metaphorical 
extensions of the verb trascinare that rely upon a different type of motion schema, that 
is, the self-propelled motion schema. In the case of self-propelled motion, instead of 
being affected by an external force, one entity moves spontaneously with its direction: 


(27) “Il conflitto si trascina da anni” 
“The war drags on for years” 
(28) “Gianni si trascina in un’esistenza spaventosa” 


“Gianni is dragging himself into an awful existence” 


Examples in (27—28) have different meanings and refer to different abstract concepts, 
but both can be linked to the primary conceptual metaphor SELF- PROPELLED ACTION 
IS SELF- PROPELLED MOTION. While in the first sentence (27) the moving entity is 
represented by a long-lasting event (e.g., TIME IS A LANDSCAPE IN WHICH EVENTS 


10For the same reason, the verb trascinare (Eng., to drag) is mainly used to encode metaphors that 
imply a slightly negative meaning (see plus-minus parameter in Krzeszowski 1993). 
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MOVE THROUGH),!! in the second example (28) the moving entity is represented by a 
person, a volitional, and animate entity, who laboriously drags herself in a painful and 
difficult situation. Interestingly, in (27—28), the verb tirare cannot be applied since it 
does not happen to encode, with its semantic core, the schema of SELF- PROPELLED 
MOTION. On the contrary, in these sentences, the verb trascinare is perfectly usable 
since it also codifies the SELF- MOTION schema in its primary variation. 


6.2.2 The Marked Variation of the Verb Tirare 


As we saw in examples (19-20), the verb tirare (Eng., to pull) is mostly used to 
encode causation events, that is, to profile metaphorical scenarios in which one 
entity causes another entity to be affected by the occurrence of a new event or state 
(e.g., CONTROL OVER ACTION IS CONTROL OVER MOTION, CAUSED CHANGE OF 
STATE (or ACTION) IS CAUSED CHANGE OF MOTION, etc.). Interestingly, this verb 
often encodes causation events which entail a specific spatial relationship between 
the agentive force and the entity affected by the force: 


(29) “Non hai speranze di tirarmi dalla tua parte” 
“You cannot get me on your side” 

(30) “Sandra tira sempre” 
“Sandra is attractive” 


Metaphors in (29-30) show that the PATH schema involved in the semantic core 
of tirare entails that the shift from point A (start point focus schema) to point B 
(END POINT FOCUS schema) which is performed by the entity affected by the force 
corresponds to the spatial location of the source of the force. The verb implies that 
the motion is directed towards the actor, that is, towards the source of the force 
(TOWARDS TO schema; NEAR FAR schema). More in particular, the example (29) is 
a subcase of the metaphorical structure AGREEMENT IS BEING ON THE SAME SIDE 
(OR AGREEMENT IS PROXIMITY), in which physical closeness is the source domain 
for metaphors of similarity, solidarity, and support. The example (30) may be seen 
as a linguistical extension of the conceptual metaphors DESIRES THAT CONTROL 
ACTION ARE EXTERNAL FORCES THAT CONTROL MOTION!” and DESIRES ARE 
FORCES BETWEEN THE DESIRED AND THE DESIRER. Thereby sexual attraction 
is interpreted as a force toward physical proximity or closeness (e.g., ATTRACTION 
FORCE schema), and the desired object is interpreted as a desired state or location. 
The verb trascinare cannot be applied in similar metaphorical contexts, for two 
main reasons: first, its action-motion schema presupposes that both the agent and 


‘Interestingly, when the moving entity is represented by an inanimate entity, the verb trascinare 
always encodes figurative meanings in which the duration of a process (event or situation) is 
measured in terms of motion along a path. 

!2This metaphor also could be associated with example (25). Nevertheless, the verb trascinare does 
not bring along the same kind of inferential structure as tirare and does not entail that the attraction 
force between the agent and the target entity results in a different spatial configuration between the 
two. 
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the affected entity are in motion (CAUSED JOINT MOTION schema); and second, 
even though they move in the same direction (the agent’s direction), the final point 
reached by the affected entity does not correspond to the agent’s location and does 
not result in a sort of shortening of the distances between the entities (TOWARDS TO 
schema; NEAR FAR schema). The verb tirare seems to be also pervasively used in the 
encoding of orientational metaphors, that is, metaphors whose mapping organizes 
target concepts by means of very basic spatial vectors, such as up-down, near-far, 
in—out, center-periphery, and so on: 


(31) “Linsegnante tira su il voto di Luca” 
“The teacher raises Luca’s grade”!? 
(32) “Ho provato a tirarlo su” 


“T tried to cheer him up” 


In the example (31), the PATH schema is conceived as a SCALE, i.e., as a vertical path, 
whose points are not intended as neutral points but as values. It profiles a scenario in 
which an actor (ANIMATE ENTITY) causes an entity (OBJECT schema) to change posi- 
tion on a scale. The change of position from a point (START POINT FOCUS schema) 
to another (END POINT FOCUS schema) results in a change of state of the object 
(here conceived as a value). The metaphorical extension in (31) can be interpreted as 
a lexical representation of the metaphor CAUSE INCREASE IN QUANTITY IS CAUSE 
UPWARD MOTION, which is a special case of the more general and primary concep- 
tual metaphor MORE IS UP. Finally, example (32) represents a scenario in which 
the passage from a negative to a positive emotional state is conceptualized in terms 
of upward motion, this is caused by an external force or entity. The expression is a 
case of the conceptual metaphor CAUSE CHANGE IN MOOD IS VERTICAL MOTION, 
which is a subcase of the primary metaphor HAPPY IS UP (or IMPROVEMENT IN 
MOOD IS UPWARD MOTION). 


7 Discussion of the Results 


This work focused on the semantic description of four action verbs encoding force, 
i.e., premere (Eng., to press), spingere (Eng., to push), tirare (Eng., to pull), and 
trascinare (Eng., to drag). The analysis was organized in a way to simultaneously 
compare two pairings of verbs: on the one hand, similarities and differences between 
the verbs premere and spingere were presented; on the other hand, convergences and 
divergences between tirare and trascinare were explained. 


'3The action verb trascinare (Eng., to drag) cannot be applied to encode the metaphorical increase 
(or decrease) of a value along an imagery VERTICAL AXIS (SCALE schema). This predicate can 
only be used to encode force-motion events along the horizontal axis. The kind of force encoded 
by trascinare, in fact, presupposes that the gravitational steady state of the entities involved in the 
event does not change. The entities must move along the ground (or horizontal path), producing a 
continuous frictional force. 
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With regard to the primary variation, it has been shown that the action verb premere 
(Eng., to press) only applies to contexts in which the state of the theme affected by 
the force does not result in any form of motion (BLOCKAGE schema). Additionally, 
it has been stressed that premere focuses on the pure exertion of force (in the form 
of physical pressure), i.e., on the interaction between the entity that applies the force 
and the object towards which the same force is directed. Unlike premere, the other 
three verbs encode the MOTION schema within their inner semantic skeleton, thus 
being used to profile more kinetic action scenes. Both the verbs spingere (Eng., to 
push) and tirare (Eng., to pull) have very flexible semantics, being able to encode 
different types of action events (with or without the association of force and motion). 
Nevertheless, they mainly focus on the result of the forceful interaction between the 
entities involved in the action, that is, on the directed caused motion to which the 
object is subject to. In spingere, the motion is normally thought to be directed from 
the point of contact between the effector and the object and away from the agent; in 
tirare, the motion is normally thought to be directed from the point of contact between 
the effector and object, and towards the agent. Finally, the verb trascinare (Eng., to 
drag) represents a very specific case, as it requires a greater number of necessary 
components for its application, and always needs the CAUSED JOINT MOTION and 
the RESTRAINT REMOVAL schemas to be activated. As a matter of fact, in trascinare 
the application of force happens to be always associated with the MOTION schema., 
and it is, in some way, limited by the fact that the object has a weight and may be 
reluctant to move. These two facts currently represent a restraint that is going to be 
constantly removed to move the object along the surface it lies upon. 

This study not only aimed at showing how the semantics of action words mirrors 
the way in which we internally structure the logic of metaphorical concepts. As it 
has been stressed throughout the analysis, the differential semantic traits that charac- 
terize the four predicates strictly influence their metaphorical potential. When their 
semantic network converges, it is easier to detect the reasons why these predicates 
can be equally applied to express the same figurative meanings. On the contrary, 
when their semantic extensions start to diverge, we may wonder how it is possible 
that some metaphorical concepts can be accessed by one verb and not by the other. 
On the basis of the data, I suggest that for a metaphor to be expressed in a specific 
context, the predicate must contain specific schemas pertaining to that context. With 
respect to the evaluation of these four action verbs semantics: 


(a) Metaphors involving the target domains of PSYCHOLOGICAL MANIPULATION, 
INFLUENCE or IMPACT are enabled by the presence of the FORCE (exerted in the 
form of a pressure) image schema and have been encountered in the variation 
of the predicates premere and spingere; 

(b) Metaphors encoding CAUSATION are enabled by the combination of the FORCE 
and MOTION image schemas, and have been detected only along the semantic 
variation of spingere, tirare and trascinare; 

(c) Metaphors encoding self-propelled changes of state or action are enabled by the 
combination of the FORCE and SELF- PROPELLED MOTION image schemas, and 
have been only found in the metaphorical production of spingere and trascinare; 
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Table 4 Metaphorical potential of the four action verbs 


Metaphor type Image schemas Premere | Spingere | Tirare | Trascinare 


Manipulation, Force + + - = 
influence, impact 


Causation Force, caused motion — t 


Self-caused change of | Force, self-propelled = 


state or action motion 
Spatial orientation Force, motion, vertical | — } 
orientation 


(d) Orientational metaphors are enabled by the presence of the VERTICAL ORIEN- 
TATION image schema and have been identified only with the annotation of 
spingere and tirare, which happen to be less spatially constrained than, say, 
trascinare. 


The following table schematizes the relationship between the verbs and their 
metaphorical potential (Table 4). 


8 Conclusions 


The data extracted from the semantic variation of the verbs premere (Eng., to press), 
spingere (Eng., to push), tirare (Eng., to pull) and trascinare (Eng., to drag) suggest 
that the metaphorical extensions of these action verbs are not randomly produced 
but are the result of metaphorical processes in which sensory-motor information and 
specific image-schematic features are transferred from one domain to another, to 
enable the representation of highly abstract concepts. In particular, it was shown that 
differential semantic properties (and image-schematic structures) characterizing the 
verbs strictly impinge on their metaphorical potential, determining, in some way, the 
type of metaphorical items that may or may not be expressed (Lakoff 1990, 1993; 
Turner 1991). The analysis also shows that the same differential semantic properties 
(and image-schematic structures) are also responsible for the type of partial equiva- 
lence that can be established between the action verbs (e.g., premere and spingere), 
either when their primary or marked variations are considered. In this sense, the 
investigation of the action verbs’ semantics contributes to a better understanding of 
the way we use action information and very basic bodily schemas to shape not only 
the way we think but also the way we talk. Action verbs constitute essential linguistic 
anchors between sensory-motor experience and abstract knowledge, whose deeper 
semantic description may be used to a different number of goals and, especially, in 
the building up and structuring of linguistic resources and ontologies. Even in the 
IMAGACT ontology, a more articulated characterization of action lexicon may be 
used to improve the representation of verbs’ senses, and to systematically define the 
linguistic boundaries between sense extensions of similar action verbs (e.g., locally 


192 P. Vernillo 


equivalent verbs). Finally, the image-schematic approach may be a useful tool in 
the representation of the metaphorical network activated by each action verb stored 
in the Ontology. To reframe the research on a more general level, I believe that the 
current results may give their main contribution to the field of Cognitive Linguistics 
and semantic studies. 
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